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Behrens,  Richard  Travis  (Ph.D.,  Electrical  and  Computer  Engineering) 
ouhspace  Signal  Processing  in  Structured  Noise 
Thesis  directed  by  Professor  Louis  L.  Scharf 

Linear  signal  models  are  commonly  used  in  digital  signal  processing,  leading  naturally 
to  the  use  of  linear  subspaces  to  separate  signals  from  noise.  A  linear  model  is  often  a  realistic 
model  for  a  signal.  In  other  cases  a  linear  model  represents  a  good  approximation  to  the  signed 
and  is  used  because  of  its  mathematical  convenience. 

Some  common  types  of  noise  can  also  be  dealt  with  by  applying  a  linear  model  to  the 
noise  as  well  as  to  the  signal.  We  describe  noise  that  obeys  a  low  rank  linear  model  as  structured 
noise,  and  derive  several  signal  processing  methods  based  on  a  structured  noise  model. 

Whereas  orthogonal  projection  operators  play  a  key  role  in  the  solution  of  classical 
linear  estimation  and  detection  problems,  the  addition  of  a  structured  noise  term  to  the  model 
leads  to  oblique  projection  operators  in  the  new  solutions.  Because  of  the  importance  of  oblique 
projection  operators,  one  chapter  explores  their  properties. 

Subspace  identification  is  the  determination  of  the  modes  of  a  linear  signal  or  a  struc¬ 
tured  noise  source  based  on  observed  data.  We  consider  the  identification  of  signal  subspaces 
with  no  prior  knowledge  about  the  signal  except  that  it  obeys  a  low  rank  linear  model.  We  then 
consider  signal  subspace  identification  with  the  prior  knowledge  that  the  signal  is  a  superposi¬ 
tion  of  complex  exponentials.  We  extend  these  identification  techniques  to  a  structured  noise 
model  by  considering  the  identification  of  structured  noise  subspaces  with  varying  degrees  of 
prior  knowledge  about  the  signal  and  the  structured  noise.  We  propose  a  method  of  adaptively 
updating  existing  signal  and  noise  subspace  models  based  on  new  data.  And  we  consider  the 
issue  of  order  selection  when  identifying  subspaces. 

Another  category  of  contributions  is  parameter  estimation  with  structured  noise.  Here 
-  assume  that  the  signal  and  structured  noise  subspaces  are  known  or  have  been  identified 
from  observed  data.  We  derive  oblique  projections  for  estimating  signals  with  varying  prior 
knowledge  about  the  parameter  distributions. 


iv 

We  apply  these  results  to  the  decoding  of  complex  number  codes  for  detection  and  cor¬ 
rection  of  impulse  errors.  In  this  example  we  apply  both  subspace  identification  and  parameter 
estimation  techniques. 
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CHAPTER  I 


Introduction 

We  present  here  a  collection  of  advancements  in  the  theory  and  practice  of  digital 
signal  processing,  based  on  the  use  of  linear  subspaces.  All  subspace  signal  processing  techniques 
share  the  common  goal  of  mitigating  the  effects  of  noise.  Usually  they  are  used  in  the  context 
of  some  kind  of  an  estimation  or  detection  problem,  as  in  the  SVD  based  modification  of  linear 
prediction  discovered  by  Tufts  and  Kumaresan  [TuK82]  where  the  coefficients  of  a  whitening 
filter  for  a  given  signal  are  to  be  estimated.  Their  method  takes  advantage  of  the  fact  that 
linear  combinations  of  complex  exponential  signals  will  lie  in  a  subspace  whose  rank  is  equal  to 
the  number  of  different  complex  exponentials  present.  It  follows  that  any  components  of  the 
received  data  that  lie  outside  this  low  rank  subspace  are  due  to  noise.  This  is  a  typical  example 
of  subspace  signal  processing  where  the  signal  of  interest  is  assumed  to  lie  in  a  low  rank  linear 
subspace. 

But  noise  comes  in  many  varieties,  from  the  background  hiss  of  an  analog  magnetic 
audio  tape  to  the  sharp  crackle  of  lightning  striking  a  telephone  wire.  Previous  techniques 
of  noise  suppression  through  subspace  signal  processing  have  been  oriented  toward  the  former 
variety  of  noise.  They  model  the  desired  signal  as  a  vector  that  lies  in  a  low  rank  subspace, 
and  the  noise  as  a  random  vector  that  may  fall  anywhere  in  the  observation  space.  The  next 
level  in  noise  modeling  is  to  allow  correlated  noise  by  applying  a  shaped  probability  density  to 
the  noise  vector.  We  go  a  step  farther  and  allow  total  dependence  of  some  noise  samples  by 
assuming  that  a  significant  component  of  the  noise  lies  in  some  linear  subspace.  We  call  noise 
components  that  lie  in  a  linear  subspace  "structured  noise"  Many  of  the  new  signal  processing 
methods  we  propose  deal  with  structured  noise. 
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The  application  of  a  linear  model  to  noise  is  arguably  just  as  reasonable  as  the  appli¬ 
cation  of  a  linear  model  to  signal.  For  what  we  consider  as  our  desired  signal  in  one  problem 
may  become  interference  in  the  next.  Power  transmission  waves  at  60  Hz  may  be  signal  to 
the  power  engineer.  For  most  everyone  else  they  represent  a  ubiquitous  form  of  structured 
noise.  Lightning  may  cause  impulsive  noise  (large  amplitude  noise  that  affects  only  a  few  data 
samples),  which  is  also  a  form  of  structured  noise. 

Subspace  signal  processing  In  the  structured  noise  model  may  be  divided  into  two  main 
parts.  First  the  signal  and  noise  subspaces  must  be  identified.  Chapters  III  through  V  of  this 
dissertation  deal  with  the  problem  of  subspace  identification.  Once  identified,  the  subspaces 
must  be  used  to  some  advantage  in  solving  estimation  or  detection  problems.  Chapter  VI 
applies  the  subspace  models  to  several  estimation  problems. 

1.1  Overview 

This  chapter  contains  a  general  introduction  and  outline  of  the  dissertation,  a  summary 
of  the  research  contributions  of  this  work,  and  an  introduction  to  linear  modeling  of  signals 
and  noise.  We  also  begin  to  establish  our  mathematical  notation.  Several  types  of  signals  that 
are  well  represented  by  linear  models  are  discussed,  and  it  is  shown  how  a  model  matrix  for  a 
linear  model  is  formed  for  several  cases. 

Chapter  II  covers  some  of  the  specialized  linear  algebra  necessary  for  the  signal  sub¬ 
space  techniques  presented  in  later  chapters.  Particular  emphasis  is  given  to  projection  opera¬ 
tors,  their  properties,  and  how  to  construct  them  for  given  subspaces.  The  distinction  between 
orthogonal  projections  and  oblique  projections  is  emphasized,  and  a  coordinate  transformation 
is  derived  which  relates  the  two. 

In  Chapter  III  we  consider  the  problem  of  identifying  linear  subspaces  from  observed 
data.  The  chapter  begins  with  a  critical  evaluation  of  the  principle  of  Maximum  Likelihood 
(ML),  concluding  that  it  is  most  appropriate  for  sets  of  parameters  that  are  uniformly  dis¬ 
tributed.  or  at  least  not  known  to  be  highly  nonuniform.  We  then  present  identification  tech¬ 
niques  for  both  signal  subspaces  and  noise  subspaces. 
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In  Chapter  IV  we  extend  the  subspace  identification  techniques  of  Chapter  III  to 
the  case  where  we  have  a  prior  model  for  the  signal  which  imposes  structural  constraints 
on  the  subspace  estimates.  Specifically  we  consider  signals  composed  of  complex  exponential 
modes  whose  subspaces  must  therefore  be  spanned  by  Vandermonde  type  matrices  (we  follow 
Demeure  [Dem89]  in  applying  the  term  Vandermonde  to  non-square  matrices  whose  columns 
are  complex  power  series).  We  present  improvements  to  two  existing  algorithms  for  identifying 
such  subspaces.  We  then  extend  one  of  the  algorithms  to  deal  with  the  presence  of  structured 
noise. 

Order  selection,  the  process  of  choosing  the  appropriate  rank  of  a  signal  subspace 
or  structured  noise  subspace,  is  an  important  aspect  of  subspace  identification.  Chapter  V 
addresses  some  problems  in  subspace  order  selection. 

We  consider  a  special  set  of  estimation  problems  in  Chapter  VI.  The  distinguishing 
feature  of  these  problems  is  the  use  of  linear  models  for  both  signal  and  noise  simultaneously. 
In  other  words,  they  are  problems  of  signal  (or  parameter)  estimation  in  structured  noise.  A 
common  thread  in  most  of  the  solutions  is  the  appearance  of  oblique  projection  operators.  The 
subspaces  identified  analytically  or  by  the  techniques  of  Chapters  III  through  V  are  used  to 
determine  oblique  projections  to  be  used  for  signal  processing. 

Chapter  VII  concludes  this  dissertation  with  a  summary  of  what  has  been  accom¬ 
plished,  conclusions  and  limitations,  and  suggestions  for  extending  the  research. 

1.2  Related  Work 

The  problem  of  estimating  the  frequencies  of  multiple  sinusoids  is  viewed  here  as 
a  subspace  identification  problem.  This  problem  has  been  addressed  by  many  researchers, 
with  some  of  the  more  notable  papers  being  published  by  Prony  [Prol795],  Rife  and  Boorstvn 
[RiB76],  Tufts  and  Kumaresan  [TuK32],  Starer  and  Xehorai  [StN88],  and  Kumaresan.  Scharf 
and  Shaw  [KSS86].  The  order  selection  aspect  of  the  subspace  identification  problem  has  been 
addressed  by  Fuchs  [Fuc88],  Tuan  [Tua88],  Kumaresan.  Tufts  and  Scharf  [KTS84],  Wax  and 
Kailath  [WaKSo],  and  Akaike  [Aka74j. 
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Regarding  the  parameter  estimation  problems  of  Chapter  VI,  related  work  has  been 
published  by  Marshall  [Mar84],  [Mar85],  [Mar86],  Wolf  [Wol83],  Kumaresan  [Kum85],  and 
Scharf,  Mathys  and  Behrens  [SMB87]  in  the  context  of  error  correction  codes  and  burst  errors. 
Our  original  presentation  of  the  structured  noise  estimation  problems  addressed  in  Chapter  VI 
is  [BeS88]. 

For  a  treatment  of  the  classical  least  squares  problem  without  linearly  modeled  noise, 
see  Golub  and  Van  Loan  [GVL89]  or  Lawson  and  Hanson  [LaH74]. 

1.3  Contributions 

We  now  summarize  the  original  research  contributions  of  this  dissertation,  indicating, 
where  appropriate,  the  foundational  work  we  have  built  upon.  Specific  contributions  are 

1)  The  emphasis  on  oblique  projection  operators  as  useful  tools  in  signal  processing. 

2)  The  equations  in  Chapter  II  for  construction  of  an  oblique  projection  with  a  specified 
range  and  null  space. 

3)  The  representation  in  Chapter  II  of  an  oblique  projection  as  coordinate  transfor¬ 
mation  plus  orthogonal  projection. 

4)  The  relationship  in  Chapter  II  between  the  singular  values  of  an  oblique  projection 
and  the  principal  angles  between  its  range  and  null  space. 

5)  The  critical  evaluation  in  Chapter  III  of  the  Maximum  Likelihood  principle  and  the 
example  using  a  quadratic  equation  to  illustrate  the  pitfalls  of  blind  application  of  ML. 

6)  The  extensions  in  Chapter  III  of  the  ML  subspace  identification  technique  of  Scharf 
[Sch9 1] .  One  extension  is  to  deal  with  complex  data  and  unknown  noise  variance.  Another  is 
to  deal  with  a  constraint  on  the  identified  subspace.  The  last  and  most  significant  extension  is 
to  deal  with  the  presence  of  structured  noise. 

7)  The  presentation  in  Chapter  III  of  the  heretofore  unpublished  method  of  Steve 
Voran  [[junpublished  notes]  for  using  Total  Least  Squares  to  update  signal  subspace  models,  and 
the  extension  of  that  method  to  allow  simultaneous  updates  of  signal  subspaces  and  structured 


noise  subspaces. 
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8)  The  incorporation  in  Chapter  IV  of  the  technique  of  Starer  and  Nehorai  [StN'88] 
for  enforcing  constraints  on  the  AR  parameters  into  the  algorithm  of  Kumaresan,  Scharf  and 
Shaw  [KSS86]  (the  KiSS  algorithm,  also  called  IQML  [BrM86])  for  finding  the  ML  estimates 
of  those  parameters.  Also  a  corrected  and  clarified  presentation  of  the  KiSS  algorithm  and  a 
discussion  of  some  implementation  issues. 

9)  The  extension  in  Chapter  IV  of  the  derivations  by  Starer  and  Nehorai  [StN88] 
of  the  gradient  and  Hessian  of  the  KiSS  objective  function  to  the  case  of  complex  data  and 
parameters.  Also  the  derivation  of  more  elegant  expressions  for  the  gradient  and  Hessian  leading 
to  a  filtering  interpretation.  The  extension  to  complex  data  and  parameters  is  more  substantial 
than  it  may  first  appear,  because  of  the  complexity  of  the  equations  involved. 

10)  The  extension  in  Chapter  IV  of  the  KiSS  algorithm  to  deal  with  structured  noise. 

11)  The  derivation  in  Chapter  V  of  a  new  order  selection  rule  for  rank  reduction  in  the 
Linear  Statistical  Model.  We  first  presented  this  result  at  the  IEEE  International  Symposium 
on  Information  Theory  in  San  Diego,  January  1990,  [BeS90]. 

12)  The  derivation  in  Chapter  V  of  a  Bayes  hypothesis  test  for  order  selection  in  the 
identification  of  structured  noise  subspaces. 

13)  The  oblique  projection  estimators  in  Chapter  VI  for  signal  estimation  in  the  pres¬ 
ence  of  structured  noise.  We  first  presented  these  results  at  the  Asilomar  Conference  on  Signals. 
Systems  and  Computers,  November  1988,  [BeS88]. 

14)  The  application  in  Chapter  VI  to  decoding  block  codes  over  the  real  and  complex 
number  fields.  We  first  presented  this  result  at  the  Asilomar  Conference  on  Signals,  Systems 
and  Computers,  November  1987,  [SMB87J, 

1.4  The  Linear  Model:  Notation  and  Terminology 

We  represent  a  scalar  by  any  symbol  in  an  italic  font,  such  as  n.  All  vectors  are  column 
vectors  and  are  represented  by  a  symbol  with  an  underbar,  such  as  x .  Contexts  requiring  a  row 
vector  will  be  handled  with  the  transpose  of  a  column  vector.  All  matrices  are  represented  by 
symbols  in  a  bold  font,  and  are  usually  upper  case,  such  as  H.  The  subspace  spanned  by  the 
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columns  of  a  matrix  is  represented  with  angle  brackets  around  the  symbol  for  the  matrix,  such 
as  (H). 

A  superscript  T  is  used  to  indicate  the  transpose  of  a  matrix  or  vector,  such  as 
Hr.  For  complex  matrices  we  must  distinguish  between  the  plain  transpose  and  the  complex 
conjugate  (Uermitian)  transpose.  We  use  superscript  H  for  Hermitian  transpose  and  T  for 
plain  transpose.  The  complex  conjugate  alone  is  represented  by  a  superscript  *  A  circumflex 
over  any  variable  generally  represents  an  estimate  of  that  variable,  such  as  2  for  an  estimate  of 
x. 

Except  where  noted  all  signals  referred  to  in  this  dissertation  are  discrete  time  signals. 
In  the  linear  model  a  signal  is  characterized  as  a  weighted  sum  (linear  combination)  of  certain 
modes.  The  set  of  weights  determines  a  specific  signal  out  of  the  class  of  signals  which  obey 
the  model. 

The  convenience  and  power  of  a  linear  algebraic  framework  apply  naturally  to  vector¬ 
valued  signals.  But  even  scalar-valued  signals  discrete-time  can  be  placed  in  that  framework 
by  considering  finite  length  (windowed)  observations  as  vectors.  Arrange  the  signal  modes  as 
columns  of  a  matrix  H  and  the  mode  weights  as  elements  of  a  vector  £.  The  product  of  the 
mode  matrix  and  the  weight  vector  is  the  signal,  x  =  H£.  This  linear  model  for  the  signal 
places  x  in  a  finite  dimensional  linear  subspace  known  as  the  signal  subspace,  and  spanned  by 
the  columns  of  the  mode  matrix  H.  The  mode  weights  £  of  a  specific  signal  parameterize  its 
position  within  the  signal  subspace. 

Let  £  be  an  n- vector  representing  the  signal  of  interest.  The  linear  model  says  that 


x  =  H£, 


X 

- 

H 

9 

n 

n  x  m 

m 

Of  primary  interest  to  us  is  the  so-called  overdetermined  case,  where  n  >  m  so  that  there  are 
more  observations  than  parameters.  In  this  case,  the  space  (H)  is  the  signal  subspace. 
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1.5  The  Linear  Statistical  Model 

When  additive  random  noise  affects  a  linear  signal,  the  resulting  received  data  vector 
H  obeys  what  has  been  called  the  linear  statistical  model: 

u  =  £  +  a 


(1.2) 


n  n  x  m  m  m 

Here  u  is  an  n-vector  of  random  noise.  This  model  has  found  many  applications  in  digital  signal 
processing,  such  as  those  presented  by  Scharf  [Sch91],  Dunn  [Dun86],  and  Buckley  [Buc87]. 


1.6  The  Structured  Noise  Model 

A  more  appropriate  model  in  many  situations  is  the  following  generalization  of  the 
linear  statistical  model.  This  model  for  the  received  data  is  illustrated  in  Figure  1.1.  The 
signal  parameter  £  sets  initial  conditions,  or  excites,  the  linear  system  H  to  produce  the  signal 
x.  Noise  added  in  the  communication  channel  is  modeled  in  two  parts:  the  unstructured  noise 
v,  and  the  structured  noise  b  that  results  from  an  underlying  process  2  exciting  the  linear 
system  S. 

The  received  data  ^  is  the  sum 

it  =  x  +  b  +  v 


/■  "  s 


. 

‘ 

u 

= 

H 

£ 

+ 

S 

2 

+ 

V 

n  nxmm  n  x  t  t  n 

The  additional  term  accounts  for  what  we  call  structured,  or  low  rank,  noise  that  lies  in  the 

rank  t  subspace  (S).  It  can  be  any  signal  which  obeys  a  linear  model  but  interferes  with  the 

signal  of  primary  interest.  In  some  cases  its  power  may  be  comparable  to  or  even  greater  than 

the  power  of  the  desired  signal,  resulting  in  a  very  poor  signal-to-noise  ratio  when  processed 
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Figure  1.1  Block  diagram  of  the  structured  noise  model. 

according  to  the  ordinary  linear  statistical  model.  The  additional  term  in  the  model  allows 
knowledge  about  the  structure  of  such  noise  to  be  used  to  its  full  advantage  in  processing  the 
received  data. 

The  matrices  H  and  S  are  both  assumed  to  have  full  column  rank.  In  some  of  the 
developments  which  follow  it  is  also  necessary  that  they  be  linearly  independent,  so  that  the 
composite  matrix  [H  S]  also  has  full  column  rank.  It  is  necessary,  but  not  sufficient,  that 

m  +  t  <  n,  (1.4) 

where  m  and  t  are  the  widths  (and  ranks)  of  H  and  S,  and  n  is  the  dimension  of  the  measurement 
space  (length  of  H  and  S).  We  do  not  require  that  H  be  orthogonal  to  S. 

1.7  Motivation  for  the  Model 

The  linear  model  is  quite  versatile  in  terms  of  the  types  of  signals  which  obey  it. 
The  linear  model  includes  the  entire  family  of  ARMA  impulse  responses  such  as  complex 
exponentials,  sinusoids,  damped  sinusoids,  real  exponentials,  and  sums  of  any  of  these.  Impulse 
and  burst  signals  also  fit  the  framework  of  the  linear  model.  Sometimes  the  linear  model  can 
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serve  as  an  acceptable  approximation  to  a  nonlinear  signal.  One  such  example  is  when  a 
nonlinear  signal  is  band  limited.  We  give  some  examples  at  the  end  of  this  chapter  that  show 
how  several  specific  signals  may  be  represented  in  the  linear  model. 

In  signal  processing,  noise  is  usually  treated  as  a  full  rank  process  in  the  measurement 
space.  However,  in  many  situations  it  is  more  advantageous  (or  more  realistic)  to  model  part 
of  the  noise  as  a  process  occurring  in  a  space  of  lower  dimensionality,  which  is  then  mapped 
into  the  measurement  space  by  some  physical  system.  When  the  physical  system  is  a  linear 
map,  the  structured  noise  obeys  a  linear  model.  In  the  measurement  space,  the  resulting  noise 
is  low  rank  and  exhibits  a  structure  dependent  on  the  physical  system.  Thus  we  use  the  terms 
low  rank  noise  and  structured  noise  interchangeably. 

The  structured  noise  model  of  Equation  1.3  applies  the  modeling  versatility  of  the 
linear  model  to  both  the  signal  and  the  noise  simultaneously.  It  is  especially  appropriate  in 
any  environment  containing  several  competing  signals,  each  of  which  constitutes  noise  from  the 
perspective  of  the  others. 

A  classic  example  of  competing  signals  occurs  in  any  multiuser  communication  chan¬ 
nel,  such  aw  the  broadcast  spectrum.  Previous  approaches  to  this  problem  have  often  centered 
around  making  the  signals  of  each  user  orthogonal  to  all  other  users'  signals.  But  there  may 
be  limits  to  the  amount  of  control  a  designer  has  over  the  competing  users.  Orthogonality 
between  the  various  competing  signal  subspaces  is  clearly  not  always  attainable,  and  the  struc¬ 
tured  noise  model  gives  us  a  tool  for  dealing  with  competing  signals  without  the  orthogonality 
requirement. 

1.8  Examples 

Power  line  noise  can  appear  in  received  data  as  a  sinusoid  of  known  frequency  (e.g.  60 
Hz  or  50  Hz).  It  lies  in  a  rank  2  linear  subspace  and  the  two  unknown  parameters  amplitude 
and  phase  determine  the  position  within  that  subspace.  The  Vandermonde  matrix  which  spans 
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this  structured  noise  subspace  is 

(e;2»//f’)  o 
S  =  (ei*T"F)2 

,(ei2T//F)n-l 

where  /  is  the  power  line  frequency  and  F  =  l/T  is  the  sampling  frequency.  The  two  param¬ 
eters  which  multiply  the  columns  of  S  are  not  actually  amplitude  and  phase,  but  are  another 
parameterization  of  the  amplitude  and  phase  information. 

The  character  of  the  problem  changes  somewhat  when  the  frequency  of  the  interfering 
sinusoid  is  unknown.  Frequency  enters  the  equation  nonlinearly,  through  the  Vandermonde 
matrix  S.  Determination  of  frequency  is  thus  equivalent  to  a  subspace  identification  problem 
and  is  treated  in  Chapter  IV. 

For  a  band  limited  signal,  a  linear  model  does  not  necessarily  apply,  but  an  ap¬ 
proximating  linear  model  can  be  constructed  using  discrete  prolate  spheroidal  wave  functions 
(DPSWF’s)  [Sle78].  Begin  by  forming  the  autocorrelation  matrix  R  of  the  band  limited  signal. 
For  a  signal  with  power  spectrum 


(e-;2T//F)l 

(e->2ir//F)2 

(e-j2»//Fy»-l_ 


-{ii 


if  m  <  a 

if  ft  <  lull  <  T, 


S(e}“) 

the  elements  of  the  autocorrelation  matrix  R  are  given  by 

i,  j  =  1 . . .  n. 


(1.6) 


ft  sinft(i  —  j) 
ra  =  —  - 


(1.7) 


t  Omega(i-j)' 

The  eigenvectors  of  R  are  index  limited  DPSWF’s  and  are  the  vectors  used  to  form  the  signal 
subspace.  Choosing  the  eigenvectors  corresponding  to  the  m  largest  eigenvalues  results  in  the 
m-dimensional  signal  subspace  containing  the  greatest  possible  portion  of  the  signal  energy. 

We  have  given  two  examples  of  signals  whose  linear  model  may  be  constructed  from 
theoretical  considerations.  With  that,  we  would  like  to  move  into  the  estimation  of  signal 
subspaces  in  situations  where  theoretical  models  are  insufficient  to  completely  determine  the 
subspace.  But  first  we  must  lay  some  mathematical  groundwork,  and  we  turn  to  that  in  the 


next  chapter. 


CHAPTER  II 


Useful  Mathematical  Results 

In  this  chapter  we  present  some  of  the  specialized  mathematics  used  in  the  remainder 
of  the  dissertation.  The  most  significant  results  presented  in  this  chapter  are  those  involving 
oblique  projections.  These  include  the  oblique  projection  construction  formulas,  the  coordinate 
transformation,  and  the  connection  between  oblique  projections  and  principal  angles  between 
subspaces. 

2.1  Linear  Subspaces  and  Spans 

Our  signal  subspace  processing  algorithms  work  in  the  context  of  a  vector  space  of  n 
complex  elements.  In  most  cases  the  same  results  apply  to  real  n-dimensional  space.  A  set  of 
m  linearly  independent  vectors  in  such  a  vector  space  spans  an  m-dimensional  linear  subspace. 
The  subspace  is  the  collection  of  all  vectors  that  can  be  expressed  as  a  linear  combination  of 
the  m  spanning  vectors.  We  usually  arrange  the  vectors  of  a  span  as  columns  of  a  matrix.  If 
H  is  such  a  matrix,  we  designate  the  subspace  spanned  by  the  columns  of  H  as  (H). 

The  orthogonal  complement  to  a  linear  subspace  (H)  is  the  linear  subspace  consisting 
of  all  vectors  orthogonal  to  (H),  that  is,  all  vectors  orthogonal  to  every  column  of  H.  We 
use  the  symbol  (H)'L  to  represent  the  orthogonal  complement  of  (H).  In  n-space.  if  (H)  is  of 
dimension  m,  then  (H)J‘  is  of  dimension  n  —  m.  We  also  use  the  term  perp-space  to  refer  to 
the  orthogonal  complement  of  (H). 

The  intersection  of  two  linear  subspaces  (H)  and  (S)  is  the  linear  subspace  consist¬ 
ing  of  all  vectors  that  are  contained  in  both  (H)  and  (S).  The  intersection  may  be  trivial, 
containing  only  the  zero-vector,  in  which  case  we  say  that  the  subspaces  are  non-overlapping. 
Non-overlapping  does  not  imply  orthogonality  between  subspaces.  Orthogonality  is  a  stronger 
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condition  and  it  does  imply  a  trivial  intersection.  A  necessary  and  sufficient  condition  for  sub¬ 
spaces  (H)  and  (S)  to  be  non-overlapping  is  that  the  composite  matrix  (H  S]  be  of  full  column 
rank.  This  of  course  requires  the  sum  of  the  subspace  dimensionalities  to  be  less  than  or  equal 
to  n. 


2.2  Vandermonde  and  Toeplitz  Spans 

A  rectangularly  windowed  ARMA  impulse  response  with  m  simple  poles  ci  . . .  zm  lies 
in  an  m-dimensional  linear  subspace  spanned  by  a  matrix  of  the  form: 


H  = 


.0 

*i 

-l 


n  —  1 


.0 

vl 


(2.1) 


Such  a  matrix  is  called  a  Vandermonde  matrix  when  m  —  n  [GVL39].  We  follow  Demeure 
[Dem89]  in  using  the  term  Vandermonde  to  apply  also  to  the  nonsquare  matrix.  Note  that  the 
subspace  depends  only  on  the  AR  parameters,  since  they  alone  determine  the  pole  locations. 
The  position  of  the  ARMA  impulse  response  within  the  subspace  (H)  is  determined  by  the  MA 
parameters. 

Let  ao...am  be  the  \R  parameters,  that  is.  the  coefficients  of  the  monic  (a0  =  1) 
polynomial  whose  roots  are  the  pole  locations  Ci  .  .  .  :m: 


'  =  1  m- 

;  =  0 


(2.2) 


Then  the  Toeplitz  matrix  of  these  coefficients 


A  = 

0  ai  J 

is  orthogonal  to  the  Vandermonde  matrix  H: 


0  1 


^  ^nxn-trjl 


(2.3) 


AWH  =  o. 


(2.-1) 
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This  orthogonality  is  easily  verified  by  application  of  Equation  2.2.  Since  the  rank  of  A  is 
guaranteed  to  be  n  —  m  it  follows  that  the  orthogonal  complement  of  the  Vandermonde  (H)  of 
Equation  2.1  is  the  Toepiitz  (A)  of  Equation  2.3.  That  is,  (H)1  =  (A). 

Vandermonde  and  Toepiitz  matrices  are  not  generally  orthogonal  spans  for  their  sub¬ 
spaces.  Where  it  is  necessary  to  find  an  orthogonal  span  for  a  subspace  defined  by  some 
non-orthogonal  span,  we  use  either  a  QR  decomposition  of  the  spanning  matrix  or  its  SVD. 


2.3  Projection  Operators 

By  the  term  projection  we  mean  a  matrix  that  is  idempotent  (equal  to  its  own  square): 

E2  =  E.  (2.5) 


The  eigenvalues  of  a  projection  are  equal  to  0  or  1.  However,  a  matrix  whose  eigenvalues  are  0 
or  1  is  not  necessarily  a  projection. 

Orthogonal  projections.  Most  mentions  of  projections  in  the  literature  refer  only  to 
orthogjnal  projections,  the  subset  of  idempotent  matrices  for  which  the  null  space  is  orthogonal 
to  the  range.  In  other  words,  an  orthogonal  projection  whose  range  is  (H)  has  null  space  (H)  . 
A  necessary  and  sufficient  condition  for  a  projection  to  be  orthogonal  is  Hermitian  symmetry: 

PH  =  P.  (2.6) 


For  an  orthogonal  projection  Ph  whose  range  is  (H)  and  whose  null  space  is  (A)  =  (H)  .  we 
have 

PhH  =  H, 


PhA  =  0. 


(2.7) 


Oblique  projections.  Projection  matrices  which  are  not  orthogonal  are  referred  to  as 
oblique  projections.  Oblique  projections  are  idempotent  but  not  symmetric.  This  more  general 
class  of  projections  olavs  a  key  role  in  the  structured  noise  problems  of  Chapters  III  through 
VI.  Since  an  oblique  projection  lacks  symmetry,  its  null  space  and  range  are  not  orthogonal. 
For  an  oblique  projection  EH:S  whose  range  is  (H)  and  whose  null  space  is  (S),  we  have 


EH;SH  =  H. 
Eh;sS  =  0 


(2.8) 
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We  use  the  following  notation  for  projection  operators.  Orthogonal  projections  are 
represented  as  P,  usually  with  a  subscript  indicating  the  range.  Oblique  projections  are  rep¬ 
resented  as  E,  usually  with  a  double  subscript  referring  first  to  the  range  and  second  to  the 
null  space. 

Construction  of  Projections.  We  now  give  equations  that  will  allow  projection 
matrices  to  be  built  from  subspace  spans  for  desired  ranges  and  null  spaces.  Other  formulas, 
not  equivalent  to  ours,  for  building  oblique  projections  are  given  in  [KaW89].  Assume  that  H  is 
a  complex  matrix  of  size  n  x  m  having  full  column  rank,  and  likewise  that  S  is  a  complex  matrix 
of  size  n  x  t  having  full  column  rank.  Assume  further  that  (H)  and  (S)  are  non-overlapping, 
which  implies  m  +  t  <  n. 

The  well  known  formula  to  build  an  orthogonal  projection  whose  range  is  (H)  is 

=  (2.9) 

The  orthogonal  projection  whose  range  is  (H)x  is  given  by 


PHx=I-P„. 


(2.10) 


The  last  projection  operator  may  be  obtained  in  another  way  which  is  of  some  use  in  subsequent 
theoretical  analyses 

P„x  =  lim  (1+  rHH*)-1.  (2.11) 

1-1  r—oo 

The  following  proof  uses  the  Sherman- Morrison- Woodbury  matrix  inversion  formula  [GVL89J: 

lim  (I  +  rHHff)_1 

r  —  oo 

=  lim  (l-rH(I+rHffHr‘HH) 


=  lim  j  I—  H(pl  +  Hh H)~ 1  Hh 
=  1  -  H(HffH)-1Hw 


(2.12) 


=  PH- 


To  build  an  oblique  projection  whose  range  is  (H)  and  whose  null  space  contains  (S) 
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take  either  of  the  expressions 


EH:s  =  PH(I  -  S(S"PHxS)-lS"PHx), 

(2-13) 

EHlS  =  H(h"PsxH)-‘hhPsx. 

Any  remainder  of  Euclidean  space,  orthogonal  to  both  (H)  and  (S),  is  also  in  the  null  space  of 
Eh:s-  These  expressions  for  oblique  projections  merit  verification,  since  they  are,  as  far  as  we 
know,  new. 

To  verify  that  the  first  expression  for  EHis  is  idempotent,  consider 

EH;sEHls  =  Ph(I  -  S(SwPHxS)-1S//PHi)PH(I  -  S(S" PniSr^Pnx) 

=  (P HP h  -  PHS(S/fPH.S)-1SHPHxPH)(I  -  S(S"PHxS)-1S/?PHx) 

(2.14) 

=  PH(I  -  S(SflPHxS)-lSffPHx) 


In  the  preceding  sequence  of  steps  we  use  the  fact  that  PH  is  itself  idempotent  and  that 
PhxPh  =  0.  Now  we  check  the  range  and  null  space: 

EH;SH  =  P„(I-S(SwPhxS)-1S//Phx)H 

=  PhH-PhS(s"PhxS)-1s"PhxH  (2.15) 


E„;SS  =  P„(I  -  S(SwPhxS)-'ShPhx)S 
=  PH(S  -  S(S/fPHxS)“1S^PHxS) 
=  Ph(S  -  S) 


Thus  (H)  is  in  the  range  of  EH;s  and  (S)  is  in  the  null  space.  Finally,  if  A  spans  the  perp-space 
to  (H.  S)  then  PHx  A  =  A.  and  PHA  =  0.  and  SH  A  =  0.  so 

E„;SA  =  PH(  I  —  S(S/fPHxS)-'S//PHx)A 

=  P„A  -  PHS(SwPHxS)-1S//PHxA 

(2.17) 

=  0  -  P„S(S//PhxS)"1SwA 
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and  we  see  that  (A)  is  also  in  the  null  space  of  Eh;S-  Since  we  have  accounted  for  all  available 
dimensions  we  have  determined  that  the  range  of  EH;S  is  equal  to  (H)  and  the  null  space  is 
equal  to  (S,  A). 

To  verify  that  the  second  expression  for  EHlS  in  Equation  2.13  is  idempotent,  consider 
EHisEH;s  = 

=  H(H"PgXH)“1H/fPsx  (2.18) 

=  EHis- 

Check  its  range  and  null  space: 

E„iSH  =  H(H*P  .  H)-'H"P  .  H 

(2.19) 

=  H; 

EH;sS  =  H(H"p  .HJ-'h" P,xS 

(2.20) 

=  0; 

EHiSA  =  H(H//P3xH)-1HkPsxA 

=  H(H"PsiH)*‘HffA  (2.21) 

=  0. 

Thus  the  second  expression  is  a  projection  with  the  same  range  and  null  space  as  the  first 
expression.  Therefore  they  are  equal. 

Another  useful  pair  of  identities  follows  from  the  two  expressions  for  EH:s  in  Equation 

2.13: 

E„lS  =Ph(I-Es..h).  (2.22) 

ES;H  =Ps(I-Eh,s)-  (2-23) 

Where  E3;H  is  the  oblique  projection  with  range  (S)  and  null  space  (H.  A). 

Singular  values  of  projections.  It  is  well  known  that  the  singular  values  of  an  or¬ 
thogonal  projection  matrix  are,  like  its  eigenvalues.  0  or  1.  This  is  true  because  for  a  symmetric 
matrix  the  singular  values  are  equal  to  the  absolute  values  of  the  eigenvalues.  Since  the  2-norm 
of  a  matrix  is  equal  to  its  largest  singular  value,  orthogonal  projections  have  unit  2-norm  and 
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will  never  make  a  vector  longer  by  projection: 

IIPjeIIs  <  lulls-  (2.24) 

For  an  oblique  projection  this  is  not  the  case.  We  will  show  in  section  2.6  that  the  singular 
values  of  an  oblique  projection  can  be  0,  1  or  any  value  greater  than  1.  It  follows  that  obliaue 
projections  can  have  a  2-norm  greater  than  unity  and  that  | |Ej j |o  may  be  greater  than 

2.4  A  Three-Way  Resolution  of  Euclidean  Space 

Given  a  subspace  (H)  and  a  subspace  (S),  define  a  new  subspace  (A)  as  the  portion 
of  Euclidean  space  orthogonal  to  both  ( H )  and  (S).  That  is,  (A)  =  (H,  S)x.  Now  any  vector  in 
Euclidean  space  can  be  expressed  uniquely  as  the  sum  of  three  components,  one  each  in  (H), 
(S)  and  (A)  This  resolves  Euclidean  space  into  three  pieces  as  shown  in  Figure  2.1. 


<A> 


Figure  2.1  Three-way  resolution  of  Euclidean  space. 


Corresponding  to  this  geometric  resolution  is  the  algebraic  identity 

I  =  EH.S  +  Es;H  +  ?a-  (2.25) 

A  corollary  to  Equation  2.25  is 


Phs  —  Eh:s  +  ES:„, 


(2.26) 
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where  PHs  is  the  orthogonal  projection  whose  range  is  (H,  S). 

Figure  2.1  is  also  useful  to  show  how  the  oblique  projection  EH;S  works.  We  say  that 
Eh.s  projects  ^  onto  (H)  along  (S,  A).  By  this  we  mean  that  EH;sli  lies  in  (H),  and  that  the 
difference  (I  —  EH;s )lL  lies  in  (S,  A). 

2.5  A  Coordinate  Transformation  for  Oblique  Projection 

In  this  section,  we  characterize  a  general  oblique  projection  operator  as  the  composi¬ 
tion  of  a  coordinate  transformation  and  an  orthogonal  projection,  as  shown  in  Figure  2.2.  The 
required  coordinate  transformation  F  is  derived  to  satisfy 

Eh;s  =  PhT-  (2.27) 


1 

1 

1 

1 

1 

1 

c 

D 

1 

1 

1 

1 

r 

1 

i 

Es;H 


Figure  2.2  A  characterization  of  oblique  projections. 


Assume  we  have  an  oblique  projection  EH.S  whose  range  is  (H)  and  whose  null  space  is 
(S,  A),  where  (A)  is  defined  as  (H,S)X.  The  coordinate  transformation  F  should  rotate  vectors 
in  the  subspace  (S)  to  a  new  subspace  (S'),  while  leaving  vectors  in  (H.  A)  unaffected.  The 
new  subspace  (S')  must  be  orthogonal  to  (H)  to  put  it  in  the  null  space  of  PH.  To  complete 
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Figure  2.3  Building  up  the  coordinate  transformation. 

the  determination  of  (S'),  we  also  choose  it  to  be  orthogonal  to  (A),  resulting  in  the  definition 

(S')  =  (H,A)X.  (2.28) 

It  is  easily  seen  that  (S')  has  the  same  dimensionality  as  (S).  This  characterization  of  the 
desired  coordinate  transformation  leads  immediately  to  the  representation  shown  in  Figure  2.3, 
where  R.  is  the  required  rotation  from  (S)  to  (S').  The  coordinate  transformation  is  given  by 

F  =  E„iS  +  PA  +  RESlH.  (2.29) 


The  transformation  F  is  not  a  rotation.  A  rotation  which  moves  vectors  in  (S)  to  (S')  is  given 
by 


R.  =  Q3/QsH. 


(2.30) 


where  Qs  is  any  orthogonal  span  of  (S)  and  Qs,  is  any  orthogonal  span  of  (S').  Note  that  R 
could  be  any  mapping  from  (S)  to  (S')  and  F  would  still  satisfy  Equation  2.27.  The  rotation 
was  chosen  to  preserve  the  length  of  vectors  in  the  subspace  (S). 
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2.6  Principal  Angles 

Principal  angles  between  subspaces  are  a  generalization  of  the  geometrical  concept 
of  angles  between  lines  and  planes.  Given  two  subspaces  (H)  and  (S)  of  n-dimensional  space 
there  is  a  set  of  angles  formed  between  them.  The  number  of  such  angles  is  equal  to  the 
dimensionality  of  the  lower  rank  subspace. 

Golub  and  Van  Loan  [GVL89]  give  the  following  definition  for  principal  angles: 


a,-  =  arccos(  max  max  u11  v)  =  arccosfu/*  vT, 
ue(H}t>e(S) 


(2-31) 


subject  to 

uHn  —  vHv  =  1, 

uHuj=  0  j  =  1 _ i  —  1  (2.32) 

vHvj  =0  j  —  1 , ....  i  —  1 . 

Note  that  the  definition  is  recursive  in  that  the  vectors  u  and  v  for  the  ith  principal  angle  are 

constrained  to  be  orthogonal  to  all  previous  Uy  and  v_j  respectively.  Golub  and  Van  Loan  also 

show  that  the  principal  angles  may  be  computed  with  the  Singular  Value  Decomposition  as 
follows.  Let  UH  be  an  orthogonal  span  for  (H),  and  Us  an  orthogonal  span  for  (S).  Then  the 
principal  angles  between  (H)  and  (S)  are  given  by 


a,-  =  arccos  A, , 


(2.33) 


where  A,-  is  a  singular  value  of  the  product  UHWUS. 

We  extend  these  results  as  follows.  For  an  oblique  projection  EH.S  formed  from  sub¬ 
space  spans  H  and  S  according  to  Equation  2.13,  the  singular  values  of  the  projection  matrix 
are  directly  related  to  the  principal  angles  between  the  two  subspaces  (H)  and  (S).  Let  the 
singular  values  of  EH;s  be  denoted  by  <7,  and  the  principal  angles  by  a*.  Then 


1 

sin  (a,  )  ’ 


(2.34) 


To  prove  this,  begin  by  noting  that  A,  being  a  singular  value  of  UhHUs  means  that 
-\~  is  an  eigenvalue  of  UH^U3USWUH  =  UHwPsUH.  Substituting  according  to  Equation  2.33 
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cos2  a,- 

is  an  e.v.  of 

UH"P sUh 

=> 

1  —  COS2  Ofj 

is  an  e.v.  of 

I-Uh*PsUh 

=> 

sin2  a,- 

is  an  e.v.  of 

U„"(I-PS)UH 

=> 

i 

sin3  a, 

is  an  e.v.  of 

(Uh^PjxUh)-1. 

Since  eigenvalues  are  invariant  to  an  orthogonal  transformation,  this  also  implies  that 


(2.35) 


—5 —  isane.v.  of  UH(UH/'P,iU„)'1UHff.  (2.36) 

sin  aci  s 

The  matrix  in  Equation  2.36  is  equal  to  EhisEh;sH.  35  can  be  easily  verified  by  using  the 
span  UH  in  the  second  form  of  Equation  2.13  for  EHis-  It  therefore  follows  that  (1/sina,  )  is  a 
singular  value  of  EH;s  and  the  proof  is  complete. 

A  corrolary  to  Equation  2.34  is  that  the  singular  values  <r,-  of  an  oblique  projection 
that  correspond  to  principal  angles  a,  are  in  the  interval  (l,oo).  Because  an  oblique  projection 
is  low  rank,  it  also  has  singular  values  equal  to  zero  that  do  not  correspond  to  principal  angles. 
Thus,  the  singular  values  of  an  oblique  projection  matrix  may  be  0,  1,  and  any  value  greater 
than  1. 


CHAPTER  III 


Subspace  Identification  with  No  Prior  Model 

The  first  task  in  subspace  based  signal  processing  is  to  identify  the  signal  subspace 
and,  if  appropriate,  the  structured  noise  subspace.  Sometimes  these  subspaces  can  be  identified 
from  theoretical  considerations,  as  for  example  when  the  structured  noise  is  60  Hz  power  line 
noise  with  unknown  amplitude  and  phase.  In  other  cases  we  must  resort  to  observed  data  to 
identify  the  subspaces.  Even  then  we  may  or  may  not  have  enough  prior  knowledge  about  the 
signal  to  impose  constraints  on  the  subspace  estimate.  In  this  chapter  we  consider  the  problem 
of  estimating  signal  and  noise  subspaces  without  structural  constraints. 

We  begin  with  a  general  discussion  of  the  Maximum  Likelihood  (ML)  principle  in 
which  we  urge  caution  in  the  application  of  ML  estimators,  especially  in  the  context  of  the 
ML  invariance  principle.  We  then  present  an  algorithm  for  ML  estimation  of  signal  subspaces, 
and  another  algorithm  for  simultaneous  ML  estimation  of  signal  and  noise  subspaces.  The 
chapter  ends  with  an  application  of  Total  Least  Squares  for  updating  signal  and  noise  subspace 
estimates  based  on  new  data.  All  of  the  subspace  identification  algorithms  in  this  chapter  make 
use  of  the  Singular  Value  Decomposition  (SVD). 

3.1  The  Maximum  Likelihood  Principle 

We  use  the  principle  of  Maximum  Likelihood  (ML)  to  derive  several  of  our  subspace 
identification  methods.  In  ML  subspace  identification  a  joint  probability  density  is  assumed  for 
the  observations  u-  This  density  is  a  function  of  the  signal  subspace  which  is  in  turn  a  function 
of  some  set  of  parameters  a.  The  likelihood  function  for  a  given  observation  y_0  is  equal  to 
the  probability  density  for  <j_  evaluated  at  the  observation  ^  and  considered  a  function  of  the 
parameters  a.  It  is  customary  to  simplify  the  likelihood  function  by  dropping  any  constants 
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that  do  not  affect  the  location  of  the  maximum  in  terms  of  the  parameters  a.  The  ML  estimate 
of  a  is  the  value  of  a  that  maximizes  the  likelihood  function.  Since  the  signed  subspace  is  a 
function  of  a,  we  can  apply  the  invariance  property  of  ML  estimation  [Sch91]  to  say  that  we 
have  also  found  the  ML  estimate  of  the  signal  subspace. 

I  must  digress  to  discuss  the  worthiness  (or  unworthiness)  of  the  Maximum  Likelihood 
principle.  In  some  ways  the  ML  principle  is  philosophically  unattractive.  Its  basic  assumption 
is  that  whatever  observation  you  make  must  have  been  a  relatively  likely  observation.  But  this 
need  not  be  the  case — unlikely  realizations  can  and  do  occur,  especially  when  the  variance  is 
large.  A  more  attractive  principle  of  estimation  is  Maximum  A  posteriori  Probability  (MAP).  1 
In  MAP  estimation,  one  chooses  the  most  likely  parameter  values  given  the  observation  and  a 
prior  density  on  the  parameters.  If  we  consider  the  parameters  as  random  variables,  the  ML 
and  MAP  rules  can  be  stated  in  a  parallel  fashion  as 

ML:  max/jdaGiolfl); 

(3.1) 

MAP:  max/ai^Ubto). 

Thus  while  ML  makes  the  observation  likely,  MAP  makes  the  choice  of  parameters  likely. 
Unfortunately  MAP  estimation  requires  the  additional  knowledge  of  the  probability  density  of 
the  parameters.  Since  this  density  is  not  always  known,  we  cannot  always  use  MAP. 

In  spite  of  the  philosophical  oddity  behind  ML  estimation,  ML  estimators  have  some 
desirable  properties  that  make  them  a  good  choice  when  the  parameter  density  is  unknown. 
First  note  that  ML  often  corresponds  to  least  squares  estimation.  More  specifically,  when  the 
observations  consist  of  signal  plus  zero-mean  white  Gaussian  noise  the  ML  estimator  is  the 
same  as  the  least  squares  estimator  wherein  the  parameters  are  chosen  to  minimize  squared 
error  between  the  observation  vector  and  the  mean  vector  (a  function  of  the  parameters).  For 
colored  noise,  the  ML  estimator  corresponds  to  a  weighted  least  squares  solution.  The  follow  ng 
quadratic  form  represents  both  the  least  squares  objective  function  and  the  negative  log  of  the 

1  A  better  name  would  be  Maximum  A  posteriori  Likelihood.  [SchOl]. 
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likelihood  function: 

(l£o  -2»)-  (3.2) 

Here  m  is  the  mean  of  as  a  function  of  the  parameters  a,  and  R  is  the  noise  covariance  matrix 
in  the  ML  problem  and  R-1  is  the  weighting  matrix  in  the  least  squares  problem. 

The  case  for  the  ML  principle  is  further  bolstered  by  the  following  argument.  We 
might  express  the  lack  of  prior  knowledge  about  the  distribution  of  the  parameters  by  assigning 
a  uniform  prior  probability  density  over  the  entire  parameter  domain  D  (assume  for  the  moment 
that  the  domain  is  finite  in  extent).  If  we  do  so,  the  philosophically  attractive  MAP  estimator 
becomes  identical  to  the  ML  estimator.  This  can  be  seen  by  application  of  Bayes’  rule  to  invert 
the  conditional  densities: 

max/aj^a^)  =  max |a)  ■  (3-3) 

ML  and  MAP  are  the  same  when  fa(a)  is  constant  since  the  ratio  on  the  right  hand  side  of 
Equation  3.3  is  then  constant  with  respect  to  the  maximization  over  a. 

What  if  the  parameter  domain  D  is  infinite  in  extent?  Then  a  uniform  density  would 
have  a  value  of  zero  everywhere,  and  the  MAP  estimator  would  be  undefined.  In  this  case  we 
cannot  make  the  preceding  argument  that  ML  corresponds  to  MAP,  but  we  can  argue  that 
it  doesn’t  matter  because  this  case  is  physically  unrealistic.  To  simplify  the  argument  let  us 
assume  that  the  parameter  is  a  real  scalar  a,  and  the  domain  D  is  a  subset  of  the  real  line 
with  the  Lebesgue  measure  of  D  infinite.  Then  for  any  finite  M  >  0,  the  measure  of  the  set 
-4  =  {x  :  |x|  <  M}  is  2M,  which  is  finite.  Thus  the  probability  that  a  €  .4  is  0  under  a  uniform 
density  over  an  infinite  D.  It  is  difficult  to  imagine  a  parameter  in  the  real  world  that  has 
probability  1  of  being  larger  than  every  finite  number  A/.  We  conclude  that  in  all  physically 
meaningful  cases,  the  ML  estimator  corresponds  to  the  MAP  estimator  with  a  uniform  prior 
density  on  the  parameters. 

This  connection  between  ML  and  MAP  serves  as  a  justification  of  ML  as  long  as  we 
have  no  reason  to  reject  a  uniform  prior  density  of  the  parameters.  But  we  should  be  aware 
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that  by  using  ML  we  are  giving  tacit  approval  to  a  uniform  prior.  This  raises  an  interesting 
question  with  regard  to  the  invariance  principle  of  ML  estimation.  The  invariance  principle 
states  that  if  2  is  a  deterministic  function  of  a, 

0.  =  G(a),  (3.4) 


and  if  3  is  the  ML  estimator  of  a,  then  £  =  G(a)  is  the  ML  estimator  of  £  [Sch91).  The 
problem  is  that  for  most  functions  G ,  it  is  inconsistent  to  allow  that  both  &  and  £  are  uniformly 
distributed.  If  one  set  of  parameters  is  uniform  we  can  calculate  a  specific  nonuniform  density 
for  the  other  set.  The  question  is  whether  or  not  the  ML  estimators  3  and  £  obtained  by 
ti:e  invariance  principle  can  be  simultaneously  appropriate.  This  question  is  addressed  in  the 
following  example. 

Example:  Complex  Quadratic  Roots.  Consider  the  real  second  order  polynomial 

equation 


z2  +  aj;  +  a2  =  0. 


(3.5) 


with  a  complex  conjugate  pair  of  roots  and  zn.  Suppose  we  have  made  some  observations 
that  allow  us  to  estimate  the  coefficients  ai  and  ao  and/or  the  roots  :  1  and  *3.  Label  the 
root  with  positive  imaginary  part  by  Z\.  The  estimate  of  the  roots  might  be  expressed  by 
(p  =  |ri I,  9  —  h*!),  or  by  (a  =  Re*i,  d  =  Im*i).  The  estimate  of  *3  is  determined  by  *i.  If 
we  assume  the  roots  lie  inside  the  unit  circle,  corresponding  to  a  stable  causal  system,  we  can 
specify  the  domain  of  each  of  the  three  sets  of  parameters: 


For  ( p.6 )  :  DPi  =  {(p,0)  :  0  <  p  <  1.0  <  9  <  x}, 


13.6) 


For  (a,  J)  :  Daj  =  {(a,.J)  :  a2  +  J2  <  I,.J  >  0}.  (3.7) 

0 

For  (ai.aj)  :  Daia,  =  {(<!]. a3)  :  ^  <  a2  <  1}.  (3.S) 


These  parameter  domains  are  shown  in  Figure  3.1. 


theta 
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Figure  3.1  Parameter  domains  for  complex  quadratic  roots. 

If  one  of  these  sets  of  parameters  is  jointly  uniformly  distributer1  over  its  domain,  we 
can  compute  the  densities  of  the  other  parameterizations  using  a  theorem  in  Papoulis  [Pap*-1]. 
The  densities  in  each  row  of  Tab’**  3.1  and  Figure  3.2  are  related  by  transformations  of  variables, 
each  row  corresponding  to  a  uniform  joint  density  m  one  of  t.ie  sets  of  parameters.  N'ote  that 
when  one  set  of  parameters  is  assumed  uniform,  the  densities  of  the  other  parameterizations 
are  distinctly  nonuniform  and  in  some  cases  are  actually  unbounded. 

As  an  example  of  the  computations  involved  in  creating  Table  3.1.  we  present  the 
derivation  of  the  joint  density  of  ip. 9)  that  corresponds  to  an  assumption  of  uniformitv  of 
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Table  3.1  Density  functions  for  quadratic  root  parameterizations. 


t a i ,  a;).  Assume  that  (ai.a?)  are  uniformly  distributed  in  Da,a,: 

f  |  for  (alt  a?)  t  A.,*, 

/a,.3,(a  i,a;)  =  ^ 


'•0  otherwise. 

The  functional  relationship  between  (ai.a?)  and  (p.  9)  is  one-to-one  and  onto: 


P  =  s/a? 


The  Jacobian  matrix  for  the  transformation  is 


a\  =  —2 p  cos  9 

n 

a  2  =  p" . 


•  do 

-)o  ‘ 

da? 

J4_ 

** 

■  Ja  i 

daj  ■ 

And  the  Jacobian  of  the  transformation  is  the  ..osolute  value  of  the  determinant  of  the  Jacobian 


matrix 


Uniform  in  (rho.theta) 


Uniform  in  (rho.theta) 


Uniform  in  (rho.theta) 


0  0 

f(rho.theta) 


f(aipha.beta) 


Figure  3.2  Density  functions  for  quadratic  root  parameterizations. 
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Hence,  the  joint  density  of  the  radius  and  angle  of  zj  is 

/.,(,.»)  -  {  “  V*™*  f”  M  €  (3.13, 

t  0  otherwise. 

Next  we  compute  the  marginal  densities  for  p  and  9  by  integration  of  the  joint  density. 

Up)  =  r  ,„(,.*)*  =  3 p2,  for  0  <  P  <  1; 

(3.14) 

f»(9)  =  J  fp,a(P^)dp=^  sin0,  for  0  <  9  <  t. 

Of  the  three  choices  in  Table  3.1  for  the  density  of  the  radius  and  angle  of  si,  this  is  perhaps  the 
most  physically  realistic.  We  say  this- because  this  density  favors  frequencies  near  the  Nyquist 
rate  and  damping  coefficients  near  1. 

What  does  this  result  imply  about  frequency  estimation?  If  we  know  absolutely  noth¬ 
ing  about  prior  distributions  we  might  as  well  use  ML.  But  if  we  know  enough  to  say  that  a 
uniform  density  is  a  better  description  of  the  polynomial  coefficients  than  of  the  polar  form 
of  the  root  locations  then  we  should  probably  use  a  MAP  estimator  based  on  the  density  just 
derived  for  p  and  9.  We  will  comment  on  how  this  might  be  applied  in  the  section  on  the  KiSS 
algorithm  in  Chapter  IV. 

In  the  following  subsections  we  derive  some  ML  estimators  of  subspaces.  Their  ap¬ 
propriateness  to  a  given  problem  must  be  assessed  according  to  the  principles  just  discussed. 


3.2  ML  Signal  Subspace  ID  with  No  Prior  Model 

The  first  ML  subspace  estimation  problem  we  consider  is  the  case  where  nothing  is 
known  a  priori  about  the  signal  subspace  except  its  rank  r.  The  dimension  of  the  observation 
space  is  n.  Assume  that  m  observations  of  the  signal  vector  are  available  [m  >  r)  and  that 
thev  obev  the  linear  statistical  model  without  structured  noise: 


U,  =  H 9,  4-  v,  (1  <t  <  m). 


where  €  Cn.  £i  €  Cr.  H  £  C 


Ref  i/,)  :  \(Q,  <7'I) 


1  Ill f  *  f  :  N(Q.  <7"I) 
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Assume  further  that  is  independent  of  vj  for  i  56  j.  We  desire  an  estimate  of  the  span  of  H, 
the  signad  subspace.  The  following  result  is  a  slight  extension  of  t,ne  work  of  Scharf  [Sch91]  in 
that  here  we  take  complex  valued  data  and  we  assume  the  noise  variance  a2  is  unknown  and 
must  be  estimated  from  the  data. 

We  can  parameterize  the  signal  subspace  by  a  set  of  n  —  r  linearly  independent  unit 
vectors  that  are  orthogonal  to  the  signal  subspace.  Define  the  signal  r,  =  H 9t.  Then 

afxt=Q\  J  =  r  +  l,...,n,  (3.16) 

It  is  convenient  to  arrange  the  vectors  into  a  matrix: 

A  =  [ar+1  •••  g^]  sCnxn~r.  (3.17) 

Equation  3.16  then  becomes 

Affr,  =  Q,  (sl,..,,m.  (3.18) 
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We  would  like  to  optimize  £  with  respect  to  The  usual  approach  is  to  set  the 
gradient  of  £  with  respect  to  z,  equal  to  zero.  However,  the  gradient  does  not  exist  in  this  case 
because  z,  is  complex  and  the  complex  conjugate  function  is  not  differentiable  in  the  sense  of 
complex  variables.  All  we  really  need,  though,  is  that  the  gradient  of  £  with  respect  to  the 
real  part  of  xt,  and  the  gradient  of  £  with  respect  to  the  imaginary  part  of  x.T  both  be  zero. 
We  build  a  psuedo  gradient  by  taking  the  gradient  with  respect  to  the  real  part  plus  j  times 
the  gradient  with  respect  to  the  imaginary  part,  and  call  it  the  gradient.  If  we  assume  for 
the  moment  that  ar+1  •  •  fln  and  <r2  are  known,  then  the  gradient  (pseudo  gradient)  of  £  with 
respect  to  r,  is 

VX)£  =  -4(lf,-£«)+AA<.  (3.23) 

To  make  the  gradient  equal  to  zero  we  choose 

xt=Ut-  (T2AA(.  (3.24) 

The  constraints  are  imposed  by  writing 

A^-^AA,)  =Q  (3.25) 


and  solving  for  A( : 


A,  =  -4(A*A)-lA"a,. 


The  corresponding  solution  for  x ,  is  the  ML  estimator 

It  =  (I“  X(AHA)-lXH)Ut 
=  (I  -  Pa)&- 

The  resulting  maximum  value  of  the  log-likelihood  function  is 

L(A,  cr2)  =  -mn  ln(2;r<72)  -  —  Y]  P 

Z  <T  “  *■  ■ 


m 


t=i 


=  -mn  ln(2T<r-)  -  — Tr(PAR), 


where  R  is  the  sample  correlation  matrix: 

R=i^I^  e  c"xn. 


(3.26) 


(3.27) 


(3.28) 


(3.29) 
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This  ML  estimator  fits  the  conditions  given  earlier  for  equality  with  the  least  squares 
estimator.  It  was  derived  under  the  assumption  of  additive  stationary  zero-mean  white  Gaussian 
noise.  The  corresponding  least  squares  problem  is  equivalent  to  finding  the  rank  r  matrix  closest 
in  Frobenius  norm  to  the  given  data  matrix  Y  =  [at  . . .  jjm]  €  Cnxm.  This  problem  was  solved 
by  Eckart  and  Young  [EcY36].  The  solution  is  to  form  the  signal  subspace  estimate  by  taking 
the  r  dominant  left  singular  vectors  of  Y.  This  solution  is  identical  to  the  ML  estimator,  which 
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takes  the  r  dominant  eigenvectors  of  R  =  YYH ,  because  the  eigenvectors  and  eigenvalues  of  R 
are  the  same  as  the  left  singular  vectors  and  squares  of  the  singular  values  of  Y. 

The  last  step  is  to  maximize  the  log-likelihood  with  respect  to  a2  and  obtain  an 
estimate  of  the  noise  variance.  The  maximum  value  of  log-likelihood  from  the  previous  step  is 


L(<r2)  =  -mn  ln(2tr<72)  -  ^  A2.  (3.35) 

i=r+l 

Differentiating  with  respect  to  tr2  gives 

dff2  ~  ~mn2ira2  ^  +  2a*  ^  A,?'  (336) 

i=r  +  l 

Setting  the  derivative  to  zero  and  solving  for  <r2  produces  the  ML  estimator  of  variance 


*2  =  ^  •  (3.37) 

i=r+l 

This  is  almost  what  one  would  expect  for  a  variance  estimator,  except  for  the  normalization  by 
n.  Since  we  are  summing  n  -  r  singular  values  it  would  seem  natural  to  normalize  by  n  -  r. 
There  is  nothing  to  stop  us  from  changing  the  normalization,  and  the  resulting  estimator  might 
be  better,  but  it  would  not  be  the  ML  estimator. 


3.3  Constrained  ML  Signal  Subspace  ID 

We  consider  now  a  variation  of  the  Maximum  Likelihood  signal  subspace  identification 
problem  treated  in  the  preceding  section.  In  this  section  we  impose  the  constraint  that  the 
identified  signal  subspace  be  contained  in  a  given  higher  rank  subspace  (V).  The  given  subspace 
(V)  could,  for  example,  be  a  users  allotted  portion  of  a  Code  Division  Multiple  Access  (CDMA) 
information  channel.  Any  signal  of  significance  to  the  user  would  lie  in  (V),  but  the  user  may 
be  interested  in  identifying  a  lower  rank  signal  subspace. 

While  the  other  constrained  identification  problems  are  dealt  with  in  Chapter  IV.  the 
simplicity  of  this  constraint  and  the  appearance  of  the  SVD  in  the  solution  tie  this  problem  to 
the  unconstrained  identification  problems  of  the  present  chapter. 

Assume  as  before  that  m  observations  of  the  signal  n-vector  are  available  (m  >  r) 


and  that  they  obey  the  linear  statistical  model  without  structured  noise  as  given  in  Equation 
3.15.  We  desire  an  estimate  of  the  signal  subspace  spanned  by  H,  under  the  constraint  that 
<H)  C  (V). 

As  before  we  can  parameterize  the  signal  subspace  by  a  set  of  n-r  linearly  independent 
unit  vectors  that  are  orthogonal  to  the  signal  subspace,  and  collect  the  vectors  a;  into  a  matrix 
A: 

A^r,  =  Q,  t  =  1...  ,,m.  (3.38) 


Let  B  be  a  matrix  that  spans  the  orthogonal  complement  of  (V) .  Now  since  (H)  C  (V) 
it  follows  that  (B)  C  (A).  Our  constraint  therefore  serves  to  predetermine  a  portion  of  the 
subspace  (A),  so  we  can  represent  A  as  the  concatenation  of  a  known  portion  B  and  an  unknown 
portion  A  which  we  may  take  to  be  orthogonal  to  B: 


A  =  [  B  A], 

(3.39) 

Pa  =  Pb+Pa- 

The  Lagrangian  for  constrained  minimization  of  the  negative  log-likelihood  function 
is  the  same  as  in  Equation  3.21, 


1  m  m 

C  =  mn  ln(2x<r2)  +  —  ^(<*(  -  xt)H(Ut  -  xt)  +  A?Ahx,, 
i=i  t=i 


(3.40) 


and  the  development  proceeds  unchanged  through  Equation  3.28: 


L(A,<72)  =  -mn  ln(2xcr2)  -  ^  ^U?PAU, 


t=l 


(3.41) 


=  —mn  ln(2x<72)  -  ^jTr(PAR). 

The  next  step  is  to  maximize  with  respect  to  the  unknown  portion  of  A,  and  here  we 
begin  to  differ  from  the  development  in  the  preceding  section.  The  compressed  negative  log- 
likelihood  function  in  Equation  3.41  may  be  resolved  into  a  fixed  portion  for  B  and  a  variable 
portion  for  A: 
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So  the  problem  reduces  to  finding  A  orthogonal  to  B,  or  equivalently  A  in  (V),  to 


minimize 

L(A,<t2)  ~  -mn  ln(2jr<r2)  -  ^jTr(PAR) 

=  -mn  ln(27r<r2)  -  £^Tr(PARPA). 
But  since  <^A^  C  (V)  we  know  that 

P^Pv  =  Pa, 


(3.43) 


(3.44) 


so  we  can  replace  P^  in  Equation  3.43  by  P^PV  where  Pv  is  the  orthogonal  projection  onto 
(V).  Now  let 

R=PVRPV.  (3.45) 

With  this  definition  of  R  the  log-likelihood  function  becomes 


L( A,  <r2)  =  -mn  ln(2ff<r2)  -  ^  Tl(PARPA).  (3.46) 

The  projected  sample  correlation  matrix  R  will  now  play  the  same  role  in  this  constrained 
subspace  identification  problem  as  R  played  in  the  unconstrained  problem. 

Let  the  projected  sample  covariance  matrix  R  have  the  orthogonal  decomposition 


R=  UA2UH, 


(3.47) 


where  U  is  unitary  and  A2  is  diagonal  and  ordered: 

VHU  =  I 

u  =  [ui...«r+1...uj  er» 

(3.48) 

A2  =  diag(A2...A:+1...A2)  €  R"x" 

A2  >  A;  >  ■■  >  A2. 

Since  R  has  been  projected  onto  (V)  it  will  have  a  set  of  zero  eigenvalues  A,  corresponding  to 
eigenvectors  u,-  that  span  (B),  the  orthogonal  complement  of  (V). 

As  in  the  unconstrained  problem  of  the  last  section,  the  negative  log-likelihood  is 
bounded  by  an  expression  involving  the  sum  of  the  largest  eigenvalues  of  R  and  the  bound  is 
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achieved  when  PA  is  a  projection  onto  the  subspace  {U2)  spanned  by  the  n  —  r  least  dominant 
eigenvectors  of  R.  The  projection  PA  chosen  in  this  way  will  automatically  contain  (B)  in  its 
range  because  the  eigenvalues  corresponding  to  (B)  are  zero. 

In  summary,  the  constrained  ML  estimate  of  the  signal  subspace  H  is  the  space  orthog¬ 
onal  to  A,  which  is  the  space  spanned  by  the  r  most  dominant  eigenvectors  of  the  projected 
sample  correlation  matrix  R  =  PvRPv  =  mHT=i(^>vy.t)(PvU.t)H ■  A  simple  way  to  think 
about  the  solution  is  to  project  all  of  the  received  data  y,  onto  the  given  constraint  subspace 
(V),  and  then  proceed  exactly  as  in  the  unconstrained  problem  by  forming  the  sample  correla¬ 
tion  matrix,  taking  its  SVD,  and  choosing  the  space  spanned  by  the  r  dominant  eigenvectors 
as  the  signal  subspace.  We  can  go  on  to  estimate  exactly  as  in  the  unconstrained  problem 
once  the  signal  subspace  has  been  determined. 

3.4  ML  Signal  and  Noise  Subspace  ID  with  No  Prior  Signal  Model 

Consider  the  problem  of  simultaneously  identifying  the  signal  subspace  and  the  struc¬ 
tured  noise  subspace.  As  before  we  assume  that  nothing  is  known  a  priori  about  the  signal 
subspace  except  its  rank  r.  However,  we  cannot  make  a  similar  assumption  about  the  structured 
noise  subspace  because  there  must  be  some  distinguishing  feature  that  allows  us  to  discrimi¬ 
nate  between  signal  and  structured  noise.  For  the  sake  of  this  development,  we  assume  that 
the  structured  noise  is  impulsive,  having  q  nonzero  elements  in  unknown  positions.  It  is  not 
important  that  q  be  known  a  priori. 

As  before  the  dimension  of  the  observation  space  is  n.  Assume  that  m  observations 
of  the  data  vector  are  available  (m  >  r  +  q)  and  that  they  obey  the  structured  noise  model: 

U,  =  Hi,  +  Sfi,  +  vt  ( 1  <  f  <  m),  (3.49) 

where  €  Cn,  £,  6  C,  H  €  Cnxr, 

£,€€*,  S  €  {0.  l}n*'?. 


Re(i/()  :  N(Q,<72I)  1L  Irnit;,)  :  N’(0.  <r"I) 
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Assume  further  that  j/;  is  independent  of  for  i  ^  j.  We  desire  an  estimate  of  the  signal 
subspace  (H),  and  the  structured  noise  subspace  (S).  Note  the  limitation  that  S  is  the  same 
for  each  of  the  m  observations  of  the  data  vector.  This  would  be  the  situation,  for  example,  in 
a  sensor  array  in  which  some  of  the  sensors  were  subject  to  high  noise  levels  or  had  failed. 

Since  the  number  of  selection  matrices  S  of  rank  q  or  less  is  finite,  we  can  consider  the 
merits  of  each  one  in  turn  and  choose  the  best  one.  Assume  for  now  that  a  candidate  S  has 
been  chosen.  Then  the  derivation  of  the  maximum  likelihood  estimator  of  H  is  am  extension  of 
the  preceding  section.  We  proceed  in  a  parallel  fashion. 

We  can  parameterize  the  signed  subspace  by  a  set  of  n  —  r  linearly  independent  unit 
vectors  a,  that  are  orthogonal  to  the  signal  subspace.  Define  the  signal  x,  =  H£t,  and  the 
structured  noise  6,  =  S^t.  Then 

afx,  =  0;  i  =  r+l,...,n,  (3.50) 

It  is  convenient  to  arrange  the  vectors  a,  into  a  matrix: 

A  =  [aP+1  ...  an]  6C"x"-r.  (3.51) 

Equation  3.50  then  becomes 

Ahx(=Q_,  t  =  (3.52) 


The  mean  of  y,  is  the  signal  plus  the  structured  noise,  x,  +  bt,  and  the  density  of  y, 


is  joint  normal: 


The  log- likelihood  function,  given  itp  •  •  is 


1 

=  — mnln(2?r<72)  -  -  x,  -b,).  (3.54) 

1(7* 

r=i 

We  need  to  maximize  the  log-likelihood  function  under  the  orthogonality  constraints  of 
Equation  3.52,  The  Lagrangian  to  minimize  the  negative  log-likelihood  under  these  constraints 
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is 


m  m 

C  =  mn  ln(2T(T2)  +  t-j  ~  £4  ~  ~  £t  ~  k)  +  Yi  A*£r 


2er2 

tSl 

where  we  have  defined  the  Lagrange  multipliers 

'  \ 


1=1 


A,  = 


'!(r  +  l) 


6  C" 


Atn  J 

If  we  assume  for  the  moment  that  a,+1  -  •  Oni  £1  •  •  •  km  and  cr2  are  known, 
gradient  of  £  with  respect  to  r,  is 

Vr,£  =  -~U/,  -£,-*,)  +  AA(. 


To  make  the  gradient  equal  to  zero  we  choose 


£«  =11,  ~k,  —  <t2AA(. 


The  constraints  are  imposed  by  writing 


A*(a,  -bt  -  a2 AAt)  =  Q 


and  solving  for  A,  : 

A»  =  -^(AHA)-1AH(i[,  -  bt). 

<T* 


The  corresponding  solution  for  xt  is  the  ML  estimator 


£,=(I-A(AffA)-1A")(I£,-6t) 


=  (I-P  a)(u,-&,). 


The  resulting  maximum  value  of  the  log-likelihood  function  is 

1  m 

L( A. <2t . 2m,<r2)  =  -mnln(2Tff2)  -  —  £(&  -6,)"PA(««  -6,) 


m 


t=i 


=  -mn  ln(2T<7‘)  -  — Tr(PAR) 

where  R  is  the  sample  correlation  matrix  with  the  structured  noise  subtracted  out: 

R  =  ^X>. -$.)(*- 4. )*  e€nxn. 


(3.55) 


(3.56) 

then  the 


(3.57) 


(3.58) 


(3.59) 


(3.60) 


(3.61) 


(3.62) 


(3.63) 
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We  have  compressed  the  likelihood  by  maximizing  with  respect  to  z,  so  that  it  is 
no  longer  a  function  of  The  next  step  is  to  maximize  with  respect  to  A.  Let  R  have  the 
orthogonal  decomposition 

R=UA2U/f,  (3.64) 

-  .  _2 

where  U  is  unitary  and  A  is  diagonal  and  ordered: 

~H  ~ 

U  U  =  I 


u  =  [s1---ur+1---5n]  GC"*" 

A2  =  diag(A?...A;+l...A;)  eR" 

a?>a2>...>a2. 


Then  the  log- likelihood  is  bounded  by 


771 


£(A )  =  -mnln(2T<r2)  -  — TTr(PAR) 


<  — mn  ln(2;rcr2) 


2<t2 

m 

2^ 


l'=r+l 


(3.65) 


(3.66) 


for  any  projection  PA  of  rank  (n-r).  This  bound  is  achieved  when  PA  is  a  projection  onto  the 
subspace  spanned  by  the  n  —  r  least  dominant  eigenvectors  of  R: 


A  =  u2  =  [ur+1  ...  un]  ecn*"-r, 
PA  =  U2uf  =  £  5, uf . 


(3.67) 


Next  we  minimize  the  negative  log-likelihood  with  respect  to  the  structured  noise 
kt  =  S&.  This  is  most  easily  accomplished  by  decomposing  the  correlation  matrix  R  onto  the 
orthogonal  complements  (S)  and  (S)1: 


R  =  PSRPS  +  PsiRPsi  +  PsRPsJ.  +  Psj.RPs 
1  m 

=  ~  L  ~  s2,)(Psi(  -  S&)H  +  Ps.u,u?Ps.}  (3.68) 

t=i 

+  PgRP^j.  +  PjiRPj. 

The  first  two  terms  of  the  decomposition  are  nonnegative  definite,  and  the  second  term  is 
constant  with  respect  to  2(.  The  last  two  terms,  the  cross  terms,  are  nilpotent.  The  best  choice 
of  2t  makes  all  terms  but  the  second  term  zero,  which  will  minimize  the  sum  of  the  trailing 
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singular  values  of  R  in  Equation  3.66.  To  make  the  first  term  and  the  cross  terms  zero,  we 


choose 


&  =  (SHS)-lSHUf 


This  implies  that 


bt  =  PS2«, 


'  m 

R  =  (I-PS)  (I-Ps) 

m 

t= 1  J 


=  PsxRPsx. 


The  last  step  is  to  maximize  the  log-likelihood  with  respect  to  a2  and  obtain  an 
estimate  of  the  noise  variance.  The  maximum  value  of  log-likelihood  from  the  previous  step  is 

n 

L{r2)  =  -mnln(2;r<72)  -  ^  ^ 

i=r  +  l  (3.  /  2) 

=  -mnln(2T<r2)  -  ^Tr  (PaP^RP^Pa)  • 

Differentiating  with  respect  to  <r2  gives 

S— 


Setting  the  derivative  to  zero  and  solving  for  cr2  produces  the  ML  estimator  of  variance 

i=r  + 1 

=  ^Tt(PaPsaRPsaPa)  (3-T4) 

=  2^  (P(H.s)-‘-RP(H;s)-L)  ' 

The  algorithm  to  estimate  the  signal  and  structured  noise  subspaces  and  the  noise 
variance  may  be  summarized  as  follows.  First  form  the  sample  correlation  matrix  from  the 
received  data  vectors.  Then,  for  each  possible  structured  noise  matrix  S.  perform  the  remaining 
steps  and  choose  the  S  for  which  the  final  likelihood  function  is  maximized.  If  the  rank  of  S  is 
not  known,  it  will  be  necessary  to  include  an  order  penalty  term  as  an  >rder  selection  rule.  For 
each  candidate  S.  form  the  projected  correlation  matrix  R  =  PS^RP_,*  and  find  its  orthogonal 
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decomposition.  The  r  dominant  singular  vectors  of  R.  span  the  candidate  signal  subspace.  The 
sum  of  the  trailing  n  —  r  singular  values  is  the  negative  log-likelihood  corresponding  to  that 
choice  of  structured  noise  subspace  S. 

It  is  noteworthy  that  this  algorithm  will  always  produce  an  estimate  of  the  signal 
subspace  that  is  orthogonal  to  the  estimate  of  the  structured  noise  subspace.  This  is  not,  ideal 
but  can  be  attributed  to  the  low  level  of  prior  knowledge  about  the  signal  and  structured  noise. 


3-5  Total  Least  Squares  for  Signed  Subspace  Updates 

In  this  section  we  describe  how  Total  Least  Squares  provides  a  natural  way  to  update 
signal  subspace  estimates  based  on  new  data.  The  idea  is  due  to  Steve  Voran  [unpublished 
notes].  In  the  next  section  we  extend  the  idea  to  the  structured  noise  model. 

In  the  theory  of  Total  Least  Squares,  the  prior  model 


U  =  i  +  k. 

x  =  H£,  H£Cnxm 

is  replaced  by  the  posterior  model 


a  =  Piji  +  d-Pi)ii 


(3.75) 


—  £tls  +  £.tls  i  3 . 1  6 ) 

Z-TLS  =  HQ-TLS- 

The  projection  is  chosen  to  minimize  the  total  of  the  sum  of  the  squares  of  the  elements 
of  Utls  =  (I  —  Pi  )U  plus  the  sum  of  the  the  squares  of  the  elements  of  Ah  =  (I  —  Pi)H 
(thus  the  term  Total  Least  Squares).  In  the  posterior  model,  H  is  the  estimated  (corrected 
or  updated)  signal  model,  ?TLS  is  the  estimated  signal  component  and  £71,5  is  the  esr.mated 
noise  component: 

H  =  P ;  H 


Itls  =  —  pi£  +  (3.77) 

£rts  =  (I  -  P\)u  =  l1  ~  P;  k-  (I  -  Pi  lit- 

Tiie  rightmost  equalities  differ  from  the  LS  case  because  the  projection  P;  is  not  matched  to 
die  subspace  (H).  but  is  instead  matched  to  (Hy.  a  kind  of  compromise  between  H  and  the 
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observation  jp  From  £  =  H£  we  deduce  that  the  TLS  estimate  of  £  is 

Q-tls  =  (H  H)  1H  xtls  (3.78) 

The  subspace  (vL^  on  which  the  TLS  solution  is  based  can  be  found  from  the  Singular 
Value  Decomposition  (SVD)  of  the  matrix  formed  by  concatenating  H  and  tt  as  [H  jtj: 

[Hit]  =  USVi/.  (3.79) 


Observe  that  the  TLS  error  is 

e2  —  llftTLslll  +  II^hHf 
=  11(1  ~  Pi)[H 

=  Tr{(I-Pt)[H  jt][H  al"(I-Pi)}  (3-80) 

=  Tr  {(I  -  P1)USV//VEU"(I  -  Pt)} 

=  Tr  {(I  -  P1)US2Ui/(I  -  PO} 

It  is  clear  that  to  minimize  the  TLS  error  e2  in  the  preceding  expression,  we  should  choose  Pj 
to  be  aligned  with  the  m  dominant  left  singular  vectors  in  U  which  correspond  to  the  m  largest 
singular  values  in  E,  or  equivalently  choose  the  rank-one  projection  (I  —  P t )  to  be  aligned  with 
the  one  least  dominant  left  singular  vector.  The  solution  is  unique  provided  that  the  smallest 
singular  value  is  not  repeated.  If  the  singular  values  are  ordered  from  largest  to  smallest  we 
can  writ  the  solution  for  Pi  as 


Pi  =u1u(f, 

where  U  =  [U(  u]. 


(3. SI) 


Thus.  Total  Least  Squares  can  be  used  to  update  a  given  signal  subspace  (H)  each 
time  a  new  data  vector  is  received.  If  we  wish  to  change  the  rate  at  which  the  signal  subspace 
adapts  to  new  data,  we  can  postmutipiv  (H  by  a  diagonal  weighting  matrix  before  taking  its 
SVD. 


3.G  Total  Least  Squares  for  Signal  and  Noise  Subspace  Updates 
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In  this  section  we  extend  the  TLS  subspace  updating  technique  of  the  preceding  section 
to  the  structured  noise  modei.  Suppose  we  have  a  signal  and  structured  noise  model  embodied 
in  given  matrices  H  and  S,  but  we  know  that  the  modei  matrices  may  be  subject  to  error.  This 
would  be  the  case,  for  example,  if  the  model  were  the  result  of  a  subspace  identification  process 
or  if  the  actual  subspaces  were  slowly  varying  in  time.  In  this  case,  the  technique  of  Total  Least 
Squares  (TLS)  allows  both  signal  and  structured  noise  subspaces  to  be  updated  based  on  new 
data. 

We  wish  to  approximately  solve  the  overdetermined  linear  system  given  by: 

a«H£+S£,  (3.82) 


where  we  are  given 


and 


a€Cnxl 
H€C"xm 
S  €  Cnx‘, 

m  +  t  <n. 


That  is,  we  wish  to  find  £  and  g  for  which 


(a  +  £)  =  (H  +  &h)£+  (S  +  As)fi, 


(3.83) 


and  the  total  sum  of  the  squared  perturbations 

£  =  II  [Ah  As  £]  ||> 


(3.84) 


is  minimized. 

Observe  that  the  system  can  be  rewritten  in  a  form  which  makes  it  equivalent  to  the 
original  Total  Least  Squares  (TLS)  problem  solved  by  Golub  and  Van  Loan.  We  write 

a*[HS]  i  .  (3.85) 

L<2J 


and  apply  their  TLS  solution,  but  with  additional  partitioning  of  the  matrices  involved.  That 
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is,  let  the  matrix  C  €  (£nx(m+‘+;)  defined  ^  the  concatenation  of  H,  S,  and  y: 


(3.86) 


C  =  [H  S  u}. 

Write  the  singular  value  decomposition  of  C  as 

C  =  uev* 


where  we  partition  the  matrices  as 

U  = 


E  = 


V  = 


The  singular  values  in  E  are  sorted  with  the  largest  in  the  upper  left  and  the  smallest  in  the 
lower  right  for  the  partitioning.  Again  the  solution  is  unique  only  if  there  is  a  unique  smallest 
singular  value. 

The  TLS  solution  for  the  parameters  is 


[U, 

H2 

n 

m  +  t 

1 

'£i 

al 

m  + 1 

<r2 

1 

m  + 1 

1 

'Vu 

£l2 

m 

V2i 

—22 

t 

L  1131 

”32. 

1. 

m  + 

t 

1 

(3.88) 


$  — - Iti*> 

1*32 

1 

2.  — - 1*22 

V32 


(3.89) 


with  U32  assumed  nonzero. 

The  form  of  the  TLS  solution  given  by  Zoltowski  [Zol87]  allows  the  perturbations  to 
be  easily  characterized.  Define  the  perturbed  variables 

H  =  (H  +  Ah) 

§  =  (S  +  As)  (3.90) 

5  =  (u.+  £)■ 


and  the  orthogonal  projection 


Pi  =  u,u" 


(3.91) 
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Then  by  applying  Zoltowski’s  work,  we  find  that 

H  =  PlH 

§  =  PiS  (3.92) 

2=  Put- 

This  means  that  when  the  original  overdetermined  system  of  equations  is  projected  onto  (Ui), 
the  exact  solution  to  the  projected  system  is  the  TLS  solution  to  the  original  system.  We  have 

it  as  Hi  + 

=>  Pitt  =  PiH£+ PiS£  (3.93) 

=>  u  =  Hi  +  s£. 

The  estimated  signal  is 

x  =  H£.  (3.94) 


Let  be  the  oblique  projection  with  range  and  null  space  containing  as  given  in 
Chapter  II.  Application  of  E  -  -  to  both  sides  of  Equation  3.93  gives  another  way  to  write  the 

H,S 


signal  estimate: 


E-  -u=  E-  -H6  +  E-  -Sd 
H;sa  h;s  -  h;s  ^ 


=  H£ 


(3.95) 


=  x. 

So  x  can  be  obtained  with  an  oblique  projection. 

The  complete  system  for  realizing  the  above  equations  is  shown  in  Figure  3.3.  The 
H  and  S  outputs  of  the  system  represent  updated  versions  of  the  signal  and  noise  subspaces, 
based  on  the  new  data  j£.  Recall  that  we  began  the  TLS  problem  with  the  basic  assumption 
that  there  may  be  errors  in  H  and  S,  so  these  outputs  can  be  viewed  as  an  attempt  to  correct 
those  errors  and  bring  the  model  into  agreement  with  the  observations. 

It  is  clear  that  the  updated  signal  subspace  should  depend  on  the  current  signal 
subspace  and  the  new  data.  But  the  new  signal  subspace  in  the  TLS  update  depends  also  on 
the  structured  noise  subspace  (through  the  SVD).  Conversely,  the  signal  subspace  affects  the 
update  of  the  structured  noise  subspace.  That  the  structured  noise  has  any  influence  on  the 
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H 

S 

X 


Figure  3.3  TLS  with  Structured  Noise. 

new  signal  subspace  seems  undesirable  at  first,  since  the  goal  of  the  structured  noise  model 
is  to  reduce  the  impact  of  the  structured  noise  on  the  signal  estimate.  But  we  must  consider 
that  the  new  data  has  components  of  both  signal  and  structured  noise,  and  that  the  signal 
component  cannot  be  isolated  without  using  the  structured  noise  subspace  (e.g.  through  the 
oblique  projection).  This  situation  justifies  some  interaction  between  signal  and  structured 
noise  subspaces  in  the  update,  although  the  actual  interaction  may  not  correspond  exactly  to 
the  justified  interaction. 

Partially  fixed  subspace  models.  We  now  extend  this  technique  to  deal  with  the 
case  where  only  part  of  the  model  is  subject  to  error.  As  a  basis  for  the  development  assume 
that  the  structured  noise  matrix  S  is  still  uncertain,  but  the  signal  matrix  H  is  known  to  be 

without  error.  In  this  case.  Ah  must  be  zero. 

The  key  to  solving  the  TLS  problem  for  this  case  is  the  following  observation:  The 
perturbations  £  and  the  columns  of  As  must  be  orthogonal  to  (H).  This  is  so  because  ,f  they 
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were  not,  each  could  be  resolved  into  a  component  in  (H)  and  a  component  in  (H)1.  The 
component  in  (H)  could  be  absorbed  into  the  H£  term  with  a  suitable  change  to  £  and  the 
sum  of  squared  perturbations  would  be  reduced  in  so  doing.  Since  the  solution  must  give  the 
minimum  sum  of  squared  perturbations,  the  (H)  components  must  be  zero. 

Assume  the  solution  is 


so  £  is  fixed  during  the  optimization  process.  In  both  equations  the  optimization  occurs  over 
the  same  variables  (<£,  As,  and  6 ),  with  the  same  domain,  and  the  same  constraint  (equality  in 
both  the  projected  and  unprojected  equations).  Since  the  objective  function  is  also  the  same 
(minimization  of  the  sum  of  squared  elements  in  [As  £]),  the  problems  are  equivalent. 


48 


Now  we  find  6  and  As  by  solving  the  TLS  problem  of  the  projected  equation.  Let  Pi 
be  the  orthogonal  projection  whose  range  is  spanned  by  the  first  t  left  singular  vectors  of  [S  2- 
Then  we  have 

<5  =  -(I  -  Pi)u  =  (Pi  -  PHx)a 

(3.100) 


£  =  -(I  -  Pi)a  =  (Pi  -  PHx)y 
As  =  -(I-Pi)S=(Pi  — PHx)S. 


If  we  define  Pt  as 


Pi  =(Pi  +  Ph), 


(3.101) 


then  the  following  relationships  hold: 

H  =s  Pi  H  =  H 
S  =  PiS 

2  =  Pi  U 


(3.102) 


I=EA.-£=EA;iPui. 

The  oblique  projection  is  defined  in  Chapter  II.  It  is  easy  to  see  that  the  matrix  Pi  defined 
above  is  an  orthogonal  projection.  The  main  difference  between  this  problem  and  the  problem 
where  H  was  also  subject  to  error  is  in  how  Pi  is  determined. 

There  is  a  more  general  problem  wherein  there  may  be  some  columns  of  H  known 
exactly  and  some  subject  to  error,  and  likewise  for  S.  This  problem  is  not  really  much  different 
from  the  last.  The  solution  process  is  to  project  the  equation  onto  the  subspace  perpendicular 
to  the  known  parts  of  the  model,  solve  the  projected  TLS  problem  for  Pi  and  add  back  the 
projection  onto  the  known  parts  to  arrive  at  the  projection  Pi  for  which  the  above  relations 


hold. 


CHAPTER  IV 


Subspace  Identification  with  a  Prior  Model 

In  this  chapter  we  present  algorithms  for  identification  of  signal  and  noise  subspaces 
that  are  constrained  to  obey  a  prior  structural  model.  We  consider  signals  composed  of  linear 
combinations  of  complex  exponentials,  and  we  consider  structured  noises  composed  of  linear 
combinations  of  impulses. 

4.1  ML  Signal  Subspace  ID  with  Complex  Exponential  Model 

In  this  section  we  give  a  new  presentation  of  the  “KiSS”  algorithm  of  Kumaresan, 
Scharf  and  Shaw  [KSS86]  (see  also  [EvF73],  [BrM86],  [McC89],  and  [StN88]).  Our  presenta¬ 
tion  includes  a  unified  approach  to  implementing  the  most  commonly  used  constraints  on  the 
parameters,  and  a  discussion  of  when  the  circulant  matrix  approach  of  Kumaresan  can  and 
should  be  used  for  finding  the  necessary  matrix  inverses. 

Let  the  signal  be  a  sum  of  complex  exponential  modes,  a  scalar  valued  time  series: 

rtt 

z(t)  =  (4.1) 

1=1 

Suppose  the  observed  data  consists  of  signal  plus  noise  for  n  consecutive  time  indices: 

y(t)  =  x(t)  +  v(t),  <  =  1.2 . n.  (4.2) 

Arrange  these  time  series  into  vectors  as 

'</(!)]  r*(l)1  r^(l)‘ 

K=  ,  £=  ,  v  =  ;  U,£.,v  SC".  (4.3) 

,!/(n)J  L*(")J  L^). 

The  signal  model  may  be  arranged  in  matrix  form  as 

x  =  H£, 


(4.4) 
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We  have 

a"H  =  0.  (4.9) 

We  wish  to  estimate  the  parameters  a0  ■  •  •  am  that  determine  the  signal  subspace 
through  Equations  4.8  and  4.5.  A  least  squares  approach  to  estimating  these  parameters  was 
proposed  in  [KSS86],  equivalent  to 

min||a-£||2,  (4.10) 

2,2. 

where  2  =  H(a)£.  Bresler  and  Macovski  [BrM86]  then  reported  the  same  algorithm  for  min¬ 
imization  of  the  same  objective  function,  calling  it  Maximum  Likelihood  estimation.  It  is.  of 
course,  both  least  squares  and  ML  when  the  noise  is  additive,  white,  zero-mean,  stationary  and 


Gaussian. 
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It  is  convenient  to  arrange  the  parameters  into  a  vector: 


a  — 


(4.11) 


After  compression  on  9,  the  negative  log-likelihood  function  for  parameters  a<j . . .  am ,  given 
observations  y,  is 


1(a)  =  y?PAU  =  UH  A(A*  A)~lAHy. 


(4.12) 


While  L  is  a  simple  quadratic  form  in  the  data  y,  we  need  to  minimize  it  with  respect  to  the 
parameters  a.  Since  L(a)  is  nonlinear  in  a,  iterative  numerical  approaches  have  been  used 
to  solve  it.  It  is,  in  fact,  a  rational  polynomial  function  in  the  parameters  ao  ...am,  with 
numerator  and  denominator  of  order  2(n  -  m).  This  implies  that  we  could  upper  bound  the 
number  of  local  minima  of  L(a).  One  can  envision  such  a  bound  being  useful  for  locating  the 
global  minimum  with  certainty,  although  we  have  not  proposed  any  such  algorithms. 

Let  us  digress  a  moment  at  this  point  to  discuss  how  the  principles  we  developed  in 
Section  3.1  for  application  of  ML  estimators  apply  to  the  KiSS  problem.  The  objective  function 
of  Equation  4.12  is  appropriate  if  the  parameters  a  are  uniformly  distributed,  or  at  least  if  we 
have  no  reason  to  believe  that  they  are  highly  nonuniform.  But  if.  for  example,  uniformity  is  a 
better  description  of  the  polar  coordinates  of  the  root  locations  then  the  appropriate  estimator  is 
a  MAP  estimator  based  on  the  derived  density  of  the  parameters  a  as  in  the  example  of  Section 
3.1.  The  objective  function  for  this  case  is  obtained  by  adding  to  Equation  4.12  a  term  equal 
to  the  natural  log  of  the  density  function  of  a.  This  kind  of  a  change  in  the  objective  function 
means  that  the  KiSS  algorithm  we  are  about  to  describe  would  not  apply.  Depending  on  the 
nature  of  the  density  function  for  a  it  may  be  possible  to  use  a  KiSS-like  principle  to  derive  a 
minimization  algorithm,  namely  to  hold  part  of  the  expression  in  a  constant  for  each  iteration 
and  optimize  with  respect  to  the  remaining  occurrences  of  a.  Even  if  this  is  not  feasible  it  may 
still  be  possible  to  compute  the  derivatives  needed  to  find  the  minimum  by  Newtons  method  as 
described  later  in  this  chapter.  Finally,  if  even  the  derivatives  are  intractable.  Newton's  method 
can  be  implemented  with  finite  difference  approximations  to  the  derivatives. 
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The  key  to  the  KiSS  algorithm  for  minimizing  the  quadratic  form  of  Equation  4.12  is 
to  rewrite  the  product  of  the  Toeplitz  matrix  A  and  the  vector  y  as  follows: 

A  Hu  =  Y  a,  (4.13) 

where  Y  is  a  Toeplitz  data  matrix  defined  as 

y(m+l)  •••  y(  1) 

y(m  +  1) 

y(n  -  m) 

y(n)  ■■■  y(n-m) 

Equation  4.13  can  be  viewed  as  an  expression  of  the  commutativity  of  convolution.  With  this 
identity,  the  objective  function  becomes 

L(a)  =  /Y"(AHA)-1Ya.  (4.15) 

Then  at  each  iteration,  the  matrix  (A.H  A)"1  is  held  constant  and  the  resulting  quadratic  form 
in  a  is  minimized.  At  the  zeroth  iteration,  (AHA)~l  is  set  equal  to  the  identity  matrix.  At 
each  subsequent  iteration,  it  is  built  from  the  solution  to  the  previous  iteration.  It  is  interesting 
to  note  that  the  solution  at  the  first  iteration  is  equal  to  the  Prony  method. 

Constraints  in  the  KiSS  algorithm.  Prior  knowledge  about  the  signal  may  lead 
us  to  impose  constraints  on  the  locations  of  the  roots  c; .  The  constraints  of  interest  are  spelled 
out  clearly  in  [BrM86].  Starer  and  Nehorai  [StN88]  introduced  a  simple,  yet  powerful,  model  for 
implementing  most  of  these  constraints  in  the  context  of  using  a  Newton  method  to  solve  the 
same  minimization  problem  we  are  considering.  Those  constraints  that  cannot  be  imposed  by 
the  method  of  Starer  and  Nehorai  have  generally  been  considered  too  difficult  to  enforce  at  all 
(see  [BrM86]  and  [KSS86]).  Starer  and  Nehorai  impose  the  tractable  constraints  by  modeling 
the  parameter  vector  a  as  an  affine  transformation  of  a  minimally  dimensioned  new  parameter 


vector  n,: 
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where  T  is  a  constant  rectangular  matrix,  and  £  is  a  constant  vector.  In  our  extension  of  this 
constraint  to  complex  valued  parameter  vectors  we  take  a  to  be  real  valued,  letting  T  map  it 
into  the  appropriate  complex  values. 

We  now  give  explicit  forms  of  T  and  c  to  implement  the  constraints  of  interest.  The 
matrix  Jn  is  the  n  x  n  exchange  matrix  (reverse  permutation): 


Jn  = 


0 

1 


1 

0 


€  R.” 


(4.17) 


and  I„  is  the  n  x  n  identity  matrix.  A  column  vector  of  n  zeros  is  represented  by  0o. 

Constraints  for  nontriviality.  The  “nontriviality”  constraint  requires  ao  =  1  to 
ensure  that  the  corresponding  polynomial  is  monic  and  of  degree  m.  To  implement  the  non- 
triviality  constraint  and  map  real  a  into  complex  a  we  take 

r  qL  i 


so  that  the  relationship  between  a  and  a  may  be  expressed  as 


1  1 

1 

Reaj  +  jlmd! 

on  +  jttm+l 

.  Re  am  +  ;'  Im  am  . 

.  Ofm  +  joc^m  - 

'  Re  ai 

Ream 

Imai 

,1m  am . 


(4.19) 


Constraints  for  modeling  undamped  complex  sinusoids.  When  modeling  un¬ 
damped  complex  sinusiodal  signals,  we  wish  the  roots  of  a  to  lie  on  the  unit  circle.  A  necessary, 
but  not  sufficient,  condition  for  the  roots  of  a  to  lie  on  the  unit  circle  is  complex  conjugate 
symmetry  of  the  coefficients 


(4.20) 


The  sufficient  conditions  for  roots  on  the  unit  circle  cannot  be  imposed  with  this  kind  of 
affine  constraint.  To  require  complex  conjugate  symmetry  of  the  coefficients,  along  with  the 
nontriviality  constraint  Re(ao)  =  1.  we  consider  two  subcases  For  rn  odd  we  take  q  =  im  —  l)/2 
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so  that 


- 

h 

+ 1 

J « 

"" J  <?+l 

'  1  +  ja1Ti  ' 
<*l  +  Jaq+  2 


£  -  ^2? 
1 


a  _  a?  + 

&q  j&m 


ot\  -  jaq+2 
i  -  ;«?+i  . 


For  m  even  we  take  q  =  (m  —  2)/2  and 


flf+i 

i?+i 

jl?+i 

Oj+i 

so  that 


1 + ja,+2 

«i  + jaq+3 


a,  +  jam 
aq+i 

Ocq  J&m 


.  1  -  jaq+2  . 


Imao 


£  —  &2J  +  1 

1 


Re  ai  ’ 

_  Rea?+i 
fan  a  o 


In-.a.+  i, 


Constraints  for  real  data.  If  the  data  is  real,  then  the  coefficients  a  should  be  real. 
For  real  data  with  only  the  nontriviality  constraint  we  take 

rsLi  mi 

T=  ,  ■  £=Lj'  (4'25) 

*  i 

so  that 


a  = 


(4.26) 
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Constraints  for  real  undamped  sinusoids.  For  real  data  with  constraints  of 
nontriviality  and  symmetry, 

a,=am_,,  (4.27) 


we  again  consider  two  subcases.  For  m  odd  we  take  <j  =  (m  —  l)/2  and 


s=  a 


so  that 


For  m  even  we  take  q  =  (m  -  2)/2  and 


£—  !l2?+l 

1 


so  that 


a  -  Qr?+i 


The  KiSS  algorithm.  With  the  affine  constraint  of  Equation  4. 1'5.  the  objective 
function  of  Equation  4.15  becomes 


L  a)  =  (c11  +awvw)Yw;Aia)wAia)|"!YfTa-£j. 


Let 

Q  =  YK[A(ft)wA(ii)]"1  Y>  (“<33) 

and  at  each  iteration  fix  Q  s  Q;  and  minimize 

i<(a)  =  (/  +  a/fTw)Q,(Ta  +  e).  (434) 

This  is  a  quadratic  optimization  problem.  Setting  the  derivative  of  L ,(a)  equal  to  zero. 

V^L/Col)  =  2Tw’Q,Ta  +  2T*Ql£  =  0.  (4.35) 

Since  a  is  real  valued,  Va£(a)  must  be  real  valued,  and  the  new  a  for  iteration  i  is 

A+i  =  -[Re(THQ,T)]Jl  Re(T/fQ,c).  (4  36) 

In  summary,  the  constrained  KiSS  algorithm  attempts  to  minimize  L{ a)  by  the  fol¬ 
lowing  steps: 

1  Build  appropriate  Y,T,c. 

2.  SetQ  =  YHY. 

3.  Repeat  until  convergence: 

3a.  Let  a  -  -[Re(T^QT)]-1  Oe(TwQi) 

3b.  Let  2.  =  To  +  £ 

3c.  Build  A  from  a. 

3d.  Let  Q  =  Y*(A"A)-'Y. 

4.  Stop. 

Convergence  is  considered  to  be  reached  when  no  element  of  a  changes  by  more  than  some 
specified  tolerance  from  one  iteration  to  the  next  We  note  that  while  the  KiSS  algorithm  is 
known  to  be  effective  in  practice,  there  is  no  guarantee  that  it  will  converge  near  the  global 
minimum  of  Ha). 

Our  MAfT.AB™  code  for  the  KiSS  algorithm  is  given  in  Appendix  A 
Example:  Second  order  KiSS  objective  function.  We  may  gain  some  in?ight 


into  the  nature  of  the  KiSS  objective  function  of  Equation  4.32  by  examining  it  for  a  simple  case. 
When  a  consists  of  two  real  parameters  we  can  generate  a  3-dimensional  plot  of  the  objective 
function  versus  ai  'nd  012 •  Such  a  plot  is  shown  in  Figure  411,  for  a  data  set  containing  25 
samples  of  two  complex  exponentials  and  a  second  order  model  constrained  to  have  complex 
conjugate  symmetry  (this  signal  is  the  one  used  in  [TuK82]).  You  can  see  some  tendencies 
toward  symmetry  about  the  global  minimum,  as  well  as  the  appearance  of  several  local  minima 
both  on  the  “plateau”  and  in  the  “canyon”.  The  “ridge”  surrounding  the  global  minimum 
would  tend  to  foil  descent  based  algorithms  from  most  starting  points.  The  KiSS  algorithm, 
however,  is  not  descent  based,  and  usually  converges  near  the  global  minimum  except  below 
the  threshold  SNR  discussed  later  in  the  simulations  section. 
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Figure  4.2  shows  that  the  addition  of  white  Gaussian  noise  up  to  an  input  SNR  of  10 
dB  (as  defined  in  the  following  simulations  section)  does  not  change  the  basic  characteristics  of 
the  KiSS  objective  function  but  does  make  the  “plateau”  rougher.  It  also  perturbs  the  location 
of  the  global  minimum. 


Figure  4.2  KiSS  Objective  Function  at  10  dB  SNR. 

KiSS  implementation  issues.  An  issue  involved  in  efficiently  implementing  the 
KiSS  algorithm  is  inversion  of  the  matrix 

G  =  (AWA).  (4.37) 

Kumaresan.  Scharf  and  Shaw  [KSS86]  presented  an  efficient  algorithm  for  this  inversion  based 
on  properties  of  circulant  matrices.  While  their  algorithm  is  often  a  good  idea,  it  has  some  lim- 
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Then  we  can  make  G  into  a  circulant  matrix  C  by  adding  to  it  the  product  UV^: 

’9o  9 1  •  ■  ■  9m  0  •  •  ■  0  g'm  ■■■  g\  ' 

g’  go 

''■9m 

9  m  0 

C  =  G  -  UVH  =  0  (4.42) 

0 

0  ’  9m 

9m 

9\ 

-9\  ■■■  9m  o  ■  ■  •  0  g'm  ■■■  g0 


A  circulant  matrix  is  computationally  easy  to  invert  using  the  discrete  Fourier  transform 
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[Dav79].  Let  Cj  be  the  sequence  defined  by  the  first  row  of  C,  indexed  from  i  =  0  to  i  =  n—m—l. 
Let  A/  be  the  coefficients  of  the  inverse  discrete  Fourier  transform  of  the  sequence  c,-.  Because 
of  the  symmetry  of  Ci  we  can  use  the  discrete  cosine  transform: 


n—m—l  n 

,  2ml 

A/  =  >  ^  cos  — — 

1=0  71 


forO</<n  —  m  —  1. 


m 


(4.43) 


To  compute  the  first  row  of  C-1  (again  indexed  from  zero)  we  take  the  forward  discrete  Fourier 
transform  of  the  sequence  defined  by  the  reciprocals  of  A/: 

(4.44) 


1  n~m~1  /  2-rlk  \ 

(C_1)o  *  =  -  Y'  A,-1exp(-; — : - j  for  0  <  k  <  n  -  m  -  1. 

n  ~ m  i^o  V  n-mj 


The  rest  of  C-1  is  obtained  by  circular  shifts  of  the  first  row  to  create  a  circulant  matrix.  Once 
C_1  has  been  computed,  we  can  use  the  Woodbury  identity  [GVL89]  to  obtain  G_1  with  only 
an  inverse  of  size  2m: 


G-1  =  (C  -  UV*)-1  =  C'1  +C_1U(I2m  -  V//C'1U)_1V"C*1.  (4.45) 


The  first  question  about  the  circulant  matrix  technique  for  inverting  G  is  when  it  is 
more  efficient  than  general  inversion  algorithms.  General  matrix  inversion  techniques  require 
order  (n  —  m)3  operations,  while  the  circulant  technique  reduces  it  to  order  (n  —  m )  ln(n  —  m) 
(assuming  an  FFT  algorithm)  for  inverting  C.  plus  order  (2m)3  for  the  smaller  inverse.  If  n 
is  large  with  respect  to  m,  this  is  an  improvement  in  efficiency.  Preliminary  tests  conducted 
with  MATLAB™  suggest  that  the  circulant  technique  pays  off  approximately  when  n  >  6m. 
Inversion  of  G  could  instead  be  accomplished  by  a  variation  of  the  Levinson  algorithm,  taking 
advantage  of  the  Toeplitz  structure  to  invert  G  in  order  (n  —  mj2  operations.  Hence,  the 
circulant  technique  is  likely  to  be  the  most  efficient  in  even  fewer  cases. 

The  other  limitation  of  the  circulant  technique  is  that  the  circulant  matrix  C  may  be 
singular,  e.en  when  G  is  nonsingular.  In  this  case  the  algorithm  fails.  This  problem  is  made 
worse  by  the  fact  that  certain  conditions  guarantee  singularity  of  C.  These  conditions  are  that 
a  be  real  and  symmetric  (as  for  real  undamped  sinusoids),  and  both  m  and  n  be  odd.  The 
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proof  is  to  construct  C  for  this  case  and  show  that 

Ci  =  Q 


(4.46) 


when  x  is  given  by 

r  l 

-l 

X  — 

1 

.-1 


(4.47) 


To  prove  Equation  4.46,  first  write  the  elements  of  C  as  linear  combinations  of  the  products 
ctiOj,  where  a  is  the  parameter  vector  corresponding  to  the  constraints  of  of  Equation  4.28. 
A  given  row  of  Equation  4.46  can  be  demonstrated  true  by  collecting  coefficients  of  a,  otj  into 
a  matrix  indexed  by  i  and  j.  We  have  not  included  the  details  of  the  proof,  as  they  are  not 
instructive. 

While  singularity  of  C  remains  a  real  danger,  the  impact  of  the  guaranteed  singularity 
case  is  less  than  it  first  seems.  Real  symmetry  of  a  occurs  by  constraint  when  we  are  modeling 
real  sinusoids,  but  in  this  case  m  is  unlikely  to  be  odd  since  each  real  sinusoid  is  rank  2.  Hence, 
it  is  less  likely  that  the  conditions  for  guaranteed  singularity  will  all  be  met  simultaneously. 


4.2  A  Newton  Method  for  Complex  Exponential  Subspace  ID 

The  KiSS  algorithm  was  originally  presented  with  a  “phase  2”  in  which  the  iteration 
was  modified  to  drive  the  gradient  to  zero  in  order  to  locate  the  minimum  of  the  objective 
function  exactly.  Phase  2  was  first  described  in  [EvF73]  for  real  data  and  without  constraints. 
But  [KSS86],  while  they  state  that  they  used  phase  2,  do  not  tell  how  phase  2  was  extended 
to  complex  data  with  constraints.  Bresler  and  Macovski  [Br.M86]  simply  dropped  phase  2, 
claiming  (perhaps  rightfully)  that  it  did  not  contribute  significantly  to  performance. 

A  Newton  method  makes  a  good  alternative  to  the  phase  2  of  Evans  and  Fishl  because 
of  its  quadratic  convergence  behavior.  Starer  and  Nehorai  [StN88]  derived  expressions  for  the 
gradient  and  Hessian  of  the  constrained  KiSS  objective  function  of  Equation  4.32  for  the  case 
of  real  data  and  parameters.  In  the  following  we  generalize  their  work  by  deriving  similar 
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expressions  for  the  case  of  complex  data  and  parameters.  For  implementation  of  the  Newton 
method,  we  use  and  recommend  the  modifications  described  by  Dennis  and  Schnabel  [DeS83] 
to  improve  convergence  from  poor  starting  points. 

The  following  derivations  of  the  gradient  and  Hessian  of  L (&)  are  rather  involved, 
although  the  end  results  are  pleasingly  simple.  The  basic  methods  we  have  followed  are  those 
of  Magnus  and  Neudecker  in  [MaN88].  Their  formulas  are  intended  to  apply  to  real  matrices, 
but  many  can  be  generalized  to  complex  matrices.  The  derivatives  we  seek  are  of  real  valued 
likelihood  Z,(&)  with  respect  to  read  valued  parameters  &,  so  that  only  the  intermediate  results 
are  complex  valued.  We  use  the  notation 

Xr  :  Transpose  of  a  matrix, 

X’  :  Complex  conjugate, 

XH  :  Complex  conjugate  (Hermitian)  transpose, 

X#  :  Moore-Penrose  inverse,  =  (XffX)_1Xw,  (4.48) 

X  ®  Y  :  Kronecker  product  of  matrices, 
vec(X)  :  Vectorization  of  a  matrix, 


dX  :  Differential  of  a  matrix. 

We  also  use  three  types  of  permutation  matrices:  the  reverse  permutation  matrix  J„  defined  in 
Equation  4.17,  the  circular  down  shift  matrix  Z„  defined  as 


Z„ 


(4.49) 


and  the  commutation  matrix  Km,„  defined  to  satisfy 


Km.„  vec(  A)  =  vec(Ar),  for  A  €  Cmxn 


(4.50) 


We  have  found  the  following  identities  from  Magnus  and  Neudecker  especially  useful. 
They  are  given  with  their  original  equation  numbers  at  left. 


2.2.4 


(A  :3  B)(C  3  D)  =  AC  *3  BD, 


if  AC  and  BD  exist. 


Mol) 
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2.4.5  vec(ABC)  =  (CT  ®  A)  vec  B,  (4.52) 

3.7.4  Kp,m(A®  B)  =  (B®  A)K?,„,  for  A  €  <Cmxr\  B  6  Cpxq ,  (4.53) 

9.13.17  d(X_1)  =  -X-1(<fX)X“1.  (4.54) 

Magnus  and  Neudecker's  identification  theorems  for  first  and  second  derivatives  are  also  quite 
valuable. 


Gradient.  Begin  by  letting  q  be  the  number  of  element?  in  a,  and  let  p  =  n  —  m  be 
the  dimension  of  the  perp-space  (A): 

q  6  R?  and  A  €  Cnxp.  (4.55) 

We  may  find  an  expression  for  dvec(A),  by  first  noting  that 

am  =  (T&+C)*  =  T*a  +  c*.  (4.56) 

Then  vve  can  express  the  mapping  from  a  (and  a)  to  A  as 


vec(A)  = 


The  differential  is 


’  In  1 

■  In  ' 

Zn 

Im+l 

• 

a  = 

Zn 

,  Op-ixm+1 

-zrJj 

-zr1- 

•Im+1 

[_  Op—  lxm+1 


T'o  + 


'm+t 


Op- 


p—  lxm  +  l 


dvec(A)  = 


In 

Zn 


3rn-f  1 

L  Op-  1  x  m+1 


T  'da. 


(4.57) 


(4.58) 


izr1. 

Next,  we  need  the  differential  of  AH  A: 

d  vec(AHA)  =  vec  d(  AH  A)  =  vec[AHd(A)  +  d(Aff)A) 

=  vec(AHdA)  +  Kp,p  vec(AH dA)’ , 


(4.59) 


where  Kp  p  is  a  commutation  matrix  [MaN88].  We  may  apply  Equation  4.52.  and  use  the  result 
of  Equation  4.58  to  obtain 

dvec(A*A)  =(Ip  3  AH)vec(dA)  +  Kp.p(Ip  3  Ar)  vec(dA)* 


=d p  3  Atf) 


r  In 
Zn 

zp- 


Jm+1  | 

Op  —  1  x  m-*.  i  j 


(4.60) 


Kp  „(Ip  3  Ar) 


In  1 
Zn 

zr 


1.  Op  -  1  X  m  +  ] 


T  da. 
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obtain 


Next,  we  need  the  differential  of  Y"(A"  A)-1Y.  We  use  Equations  4.54  and  4.52  to 


d{  YH(AHA)~lY)  =  -Ytf(A//A)~1d(AsA)(AHA)-1Y 


vecd[YH(AHA)"1Y]  =  -  (YT[(A* A)"1]*  ®  YH(AHA)~1)  vecd(AHA). 
Substituting  Equation  4.60  into  Equation  4.61  and  applying  Equations  4.53  and  4.51  gives 


vecd(Y/f(A/fA)_1Y]  =  {Yt[(Ah A)-1]*  ®  YH A#) 


Op—  lxm+l 


I  Jm+l 


Km+i,m+1  (YH(AHA)“l  ®  Yr(A#)') 


I  Jm* 
[Op-ix 


m+1  T  4. 

lxm+l  J 


Finally,  the  differential  of  L(a)  can  be  written  down  using  the  product  rule 
dL(a)  =  d(aHYH(AHA)~lYa) 


=  d(a)HYH(AHA)~1Ya  +  <j"  y  ■  (A"  A)-1  Yd(a)  +  aHd[YH(AH  A)-1 Y]a. 

The  first  two  terms  are  complex  conjugates,  so  they  may  be  combined  as  twice  the  real  part 
of  one  of  them.  Also  each  term  in  this  equation  is  a  scalar,  so  we  can  vectorize  the  third  term 
without  changing  anything.  This  allows  us  to  apply  Equation  4.52  and  put  the  third  term  in  a 
form  where  we  can  substitute  our  result  from  Equation  4.62.  Noting  also  that  da  =  T da,  we 
obtain 

dL( a)  =  2Re[aHYH(AHA)-lYT]da  - 


(aT  ®  aH)\  (Yr[(AHA)-1]*  ®  Y*A#) 


J  m+ 1  j  rp*  + 

Op—  1  xm  +  1  J 


Km+l.m+l  (YH(AHA)~l2YT(A*) 


T  >dct. 


Since  da  is  entirely  factored  out  to  the  right  of  this  expression,  we  can  identify  the  derivative 
of  L  with  respect  to  a  as  the  row  vector  that  remains  when  da  is  dropped  from  Equation  4.64 
(later  we  will  switch  to  the  gradient,  which  is  the  transpose  of  the  derivative).  In  the  next 


sequence  of  steps,  .ve  simplify  the  expression  for  the  derivative.  Applying  Equations  4.51  and 
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4.53  gives  us  this  expression  for  the  derivative: 

DaL(a)  =  2  Re[aH YH(AHA)_1  YT]  - 

‘  In  ' 

(arYr((AKA)-1]*  ®  aHYHA#)  Z.n  n  Jm+1  T*  - 

;  ixm+l  a 

.ZP_1J  (4.65) 

■  In  ■ 

(aHYH{AHA)-'®aTYT(A*)m)  Z n  Jm+1  T. 

.  ®p  —  1  xm+1  , 

.z 

Now  let  u  denote  the  following  function  of  the  measurement  t/: 

u  =  (AHA)~1Ya  =  A#u  €  €p.  (4.66) 

This  definition  of  u  is  obviously  convenient  for  simplifying  the  derivative  expression,  but  it 
also  has  a  physical  interpretation.  It  contains  just  the  information  necessary  to  reproduce  the 
residual,  defined  as  the  projection  of  the  data  y.  onto  the  orthogonal  complement  of  the  signal 
subspace: 

2  =  Pa  it-  (4.67) 

If  the  sequence  u  is  presented  at  the  input  of  an  MA  filter  with  coefficients  a  then  the  output 
is  the  residual  sequence: 

a  *  u  =  Au  =  PA^  =  V.  (4.68) 


Here  *  means  convolution.  Since  u  is  a  minimum  dimensional  coding  of  the  residual,  it  corre¬ 
sponds  to  the  syndrome  in  coding  theory  (see  [SMB87]). 

Proceeding  with  the  simplification  of  Equation  4.65  we  note  that  again  two  terms  are 


complex  conjugates  and  can  be  combined.  Thus 


DaL(a)  =  2  Re  {  u"YT  -  (uH  S  uT AT) 


^p-lxm  +  l 


The  last  remaining  Kronecker  product  can  be  written  out  and  simplified: 

r  In  1  r  In  1 


l/3!!rAT)  Zn  =  [  u'urAr  iuutAt  ■■■  u^A'.l  ^ 


Lzp-1  .1 

—  •  •  •  -f*  A 1  Zpn  *  ) 


=  uT\T  (I nUj  +  Zn Uo  +  Z r-lump) 
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So  the  expression  for  the  derivative  becomes 


DaL(a)  =  2  Re  j /YT  -  uTAT  (lnu-  +  Znu’2  +  ■  ■  ■  +  Zpn~lup) 


Jm  +  l 


L  Op-  1  xm  +  1  J 

Now  define  the  Toeplitz  syndrome  matrix  U  from  the  elements  u,-  of  the  syndrome  in 
the  same  manner  as  A  is  defined  from  a,: 

0 


T  .  (4.71) 


U  = 


€  C 


n  x  m+1 


This  syndrome  matrix  satisfies  the  identity 


(4.72) 


(i„uj  +  znu2  +  --+zr1u;) 


•lm  +  1 

Op-lxm+1 


=  JnU, 


(4.73) 


allowing  further  simplification  of  the  derivative: 


DaL{a)  =  2 Re  {/YT  -  urArJnUT} 


(4.74) 


It  can  be  shown  that 


urArJnU  =  uHJ?ATU-Jm+1. 


(4.75) 


so  the  derivative  can  be  written 

DaL(a)  =  2  Re  { u*YT  -  u" JpATU‘ Jm+iT } 


(4.76) 


=  2  Re  {u"(Y  -  JpArU'Jnl*1)T} 

The  gradient  is  the  transpose  of  the  derivative  (under  the  real-part  operator  we  can  take  the 
Hermitian  transpose): 


a(a)  =  VaL(a)  =  2Re{T*(Yw  -  Jm+1  UrA*Jp  )u }  (4.77) 


This  expression  is  a  generalization  of  that  of  Starer  and  Nehorai  'StNsSl.  The  gener¬ 
alization  is  from  real  data  and  parameters  to  complex  data  and  parameters.  The  real  case  of 
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our  result  agrees  with  their  result,  taking  into  account  that  we  have  made  a  slightly  different 
definition  of  U.  But  we  will  now  proceed  to  derive  a  more  elegant  expression  for  the  gradient, 
taking  advantage  of  some  fascinating  identities. 

First  note  that  JpATU’Jm+i  is  a  Toeplitz  matrix  of  the  same  form  as  Y.  In  fact,  its 
elements  are  the  residuals  V  defined  in  Equation  4.67!  That  is,  the  residual  matrix 


N  =  JpArU’Jm+i 


is  built  from  the  elements  2  >n  exactly  the  same  way  as  Y  is  built  from  u  in  Equation  4.14. 
Corresponding  to  x  =  ^  —  V,  let  us  define  an  estimated  signal  matrix 


X  =  Y  -  N. 


Now  the  expression  for  the  gradient  becomes 

a(fi)  =  VaL(a)  =  2  Re{TH(YH  -  N*)u} 

(4. SO) 

=  2Re{T*x\}. 

So  the  gradient  is  the  convolution  of  the  signal  estimate  J  with  the  syndrome  u,  transformed  by 
the  constraint  matrix  TH .  This  filtering  interpretation  is  shown  in  Figure  4.3.  At  a  maximum 
of  L(a)  the  gradient  is  zero.  For  the  gradient  to  be  zero  requires  orthogonality  between  the 
signal  estimate  and  the  syndrome.  This  orthogonality  can  be  enforced  in  the  time  domain  or 
in  the  frequency  domain,  using  the  DTFT.  This  is  another  occurrence  of  the  "orthogonality 
principle"  that  the  error  is  orthogonal  to  the  signal  estimate  in  least  squares  problems. 

Hessian.  To  implement  a  Newton  algorithm  for  minimizing  L{a)  we  also  need  the 
second  derivative  matrix,  the  Hessian.  Our  final  expression  for  the  gradient  implies  that  the 


first  differential  is 


Thus  the  second  differential  is 


dL(a)  =  daT 2  Re{THX  u}. 


d~ L( a)  =  dar2  Re{TwX  dn  +  T H d(X)-"  n) 


=  dar2Re{THX  Ju~tuT  5  T,f)Kr  m  +  ;  vec./i'X)'} 
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invars^  MA 

filtar  filter 


Figure  4.3  Filtering  Interpretation  of  the  KiSS  Gradient. 

So  the  challenges  are  to  find  the  differentials  of  the  syndrome  and  the  estimated  signal  matrix. 
Beginning  with  the  syndrome  we  have 


u  =  (AH\rlYa, 


so  the  differential  of  the  syndrome  is 

du  -  (AwA)-1YTda-^d[fAwAr']Ya 

=  f  AwA)_1YT<i«  -  ( Aw  A.i“ 1  Jl  A w  A  if  A‘v  A  i“ 1  Yo  <  4  S3  t 

=  !a"A)'‘YT4  -  (ut  G  lA"Af:i  veo/fA^Ai. 

We  can  simplify  this  to  a  fairly  nice  expression  '  y  substituting  for  vec  <l(  A"  A  i  (tom  Eu  nation 
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4.60,  applying  Equations  4.51  through  4.53,  and  using  the  fact  that  Ki  p  —  Ip: 
du  =  (AffA)‘‘YT da  -  (uT  ®  (A^A)'1)  (lp  ®  AH)  vec(dA)  - 

(uT  ®  (AH A)-1)  KPiP  (lp  ®  At)  vec(dA) 

=  (AHA)-1YTda  -  (ur  ®  A*)  vec(dA)  -  KiiP  ((A^A)-1  ®  uT)  (Ip  ®  AT)  vec(dA) 

‘  In  ' 

=  (A^  A)-1YTda  —  [uiA^  ujA^  upA#  ]  ^m+1  T’da  — 

LUp-ixna  +  i  ^ 

-zr1. 

((A^A)"1  ®  urAr)  vec(dA) 

=  (AHA)-‘YTcfa-  A#  (UlI«  +  u2Z„  +  ••  •  +  upZpn~l)  „  Jm+1  T’ da  - 

[  Op—  lxm  +  l  ^ 

^(AWA)-1  ®  vec(dA) 

=  (AffA)~1YTda  —  A#JnU"T*da  —  vec  ^2r(dA’)[(AwA)_l]TJ 
=  (AwA)~lYT(ia-(AwA)"lNJm+IT’da  -  (Aw A)_1(dAff  )2, 

(4.84) 

where  N,  introduced  in  the  last  line,  is  a  Toeplitz  matrix  of  backward  residuals 

g  =  JnA'Jpu,  (4. 85) 

analogous  to  N  for  forward  residuals  £.  The  third  term  in  Equation  4.84  for  du  may  be  simplified 
further  by  the  fact  that 

{dAH)V  =  N  da.  (4.86) 

This  identity  is  analogous  to  the  commutativity  of  convolution,  and  to  the  switch  already  used 
between  AHu  and  Ya.  Equation  4.84  then  simplifies  to 

du  =  (AHA)-lYTda-(AHA)-1>JJm+lTda-iAHA)-]>irda 

=  (AHAf1  (YT  -  NJm+i T"  -  NT)  da  (4.87) 

=  (AhA)_1  (xT  -  NJm  +  ,T’)  da. 

This  is  the  differential  of  the  syndrome.  Now  we  can  express  the  vectorization  of  the 
signal  estimate  matrix  as 

I  x  m  lp  dq 
^  [  Op  x  m  lp  ‘  ^ 

vec  X  =  x.  \  !  .$$) 

-  Op  x  rr\  fr>  ; 
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Since  £  =  il  —  Ajt,  we  have 


0 


’pxm  Ap 


Ip  jin  1 


[Opxm  lp}Zn1 


vec  dX  =  - 

l  Op  x  m  Ip  j  In 

[  Op  x  m  Ip  j2n  1 


L[Opxm  ip ;znm 


[(dA)u  +  Adu 


[(«T  3>  In)  vec(dA)  +  Adu] 


L  Opxm  Ip]Z;m 
[Opxm  Ip  jin 
[.Opxm  Ip  ]Zn  1 


[Opxm  IplZn™ 
f  [  Opxm  Ip  jin 
[  Op  x  m  Ip  jZn  1 


0  [  Opxm  Ip  ]Zn  m  J 


- 1 

(HT  3  In) 

.z  r1. 

J  m  -fl 


Op—  1  xm+1 


T-  +  A*H[XT-NJm+1T‘]  \  da 


{jnU'T*  +  A#h[XT  -  NJm+,T-]j  da. 


(4.89) 


Substituting  these  results  into  Equation  4.82  yields 
d'Ua)  =  daT  2  Re|  THx\  AH  A)- 1  [XT  -  NJm+1T*]  - 


f  [  Op  X  m  Ip  ]  In 
[  Op  xm  Ip  ]  Zn  1 

b^OpXm  Ip  jZn  m  _J 


JnUT  +  A*t[x't-  -N'jm+iT]Hdft 


'4.90) 


The  effect  of  premultiplication  by  K?  m+1  is  to  reorder  the  rows  of  the  multiplied  matrix. 
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The  result  in  this  case  can  be  expressed  as 


r 


d2L(a)  =  daT2Re|T/fX//(A//A)"1[XT  -  NJm+1T*]  -  [ujT*  u2TH 

[Jm-fl  Om  +  lxp— l] 

[Om  +  lxl  dm  +  1  Om  +  lxp-2] 


UpT 


[Om  +  lxp  —  1  ] 


(jnUT  + A#t[X.'t'  -  N*Jm+1T])  Ida 


=  daT2  Re<j  T^X^jA*  A)-l[XT  -  NJm+jT']  - 


T*UHJn  (jnUT  +  A*[(a"a)-1]*[X*T*  -  N*Jm+1T])  Ida 


H  H 

=  daT2  Re<  TWX  (AwA)_lXT  -  T^X  (A*  A)-1  NJm  +  1T‘  - 

THU"UT  -  T//Jm+1NT[(AHA)-1]*X'T-  +  T^J^N^A^A)-1]*^' J^+.T  jda. 

(4.91) 

Because  of  the  real-part  operator,  we  can  take  the  conjugate  of  the  last  two  terms  and  then 
factor  the  expression  as 


d2L(a)  =  daT 2  Re  {  [t*X*  -  TrJm+1NW]  (AH A)~l  [xT  -  NJm  +  1T'j  -  TffUHUT}  da. 

(4.92) 


Let  S  denote  the  matrix 


S  =  A(AhA  )-1  XT  —  NJm+iT"] 

J 


(4.93) 


Then  the  final  expression  for  the  second  differential  is 


d2L(a)  =  dar2  Re  {SHS  -  ThVhTJT]  da.  (4.94) 


The  ‘second  identification  theorem  "  of  Magnus  and  Neudecker  allows  11s  to  identify  -  fie  Hessian 


from  Equation  4.94  as 


H'a;  =  V;SL ia)  =  2  Re  (SWS  -  T"v"UT}  4.95) 

The  filtering  interpretation  of  the  H-ssian  ;s  diown  m  Fisure  1  1  The  data  j  ••xe:».-s 

'fie  inverse  filter  ro  produce  *],.»  -vndrome  a  I  he  ^nororne  >  f-  rwar  1  iiei  on  A  ward 
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filtered  by  the  MA  filter  A  to  produce  the  forward  and  backward  residuals  £  and  ntwiddle. 
The  constraint  T  is  applied  to  the  data  and  to  the  residuals  £  and  £,  and  the  resuduals  are 
subtracted  from  the  data.  The  difference  vector  used  to  form  a  Toeplitz  matrix  and  filtered 
again  by  the  inverse  filter  A*  to  create  the  matrix  S.  In  the  other  branch  of  the  figure,  the 
syndrome  u  is  used  to  form  the  Toeplitz  matrix  U  to  which  the  constraint  T  is  applied.  The 
Hessian  is  then  formed  by  summing  the  Grammians  of  S  and  UT. 


Figure  4.4  Filtering  Interpretation  of  the  KiSS  Hessian. 


Newton  step.  We  'an  now  implement,  a  Newton  method  for  the  minimisation  of 
Li'Ji)  Far  any  -urfiit  aeration  a, .  the  Newton  step  ,s  to  subfact  the  inverse  the  Hessian 
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times  the  gradient: 

2i+i  =  a,  -  H(a,-)_1a(ai)-  (4-96) 

Performance  of  KiSS  and  Newton.  We  have  run  simulations  to  test  the  perfor¬ 
mance  of  the  KiSS  algorithm  with  and  without  a  Newton  method  as  a  second  phase.  We  have 
used  the  same  test  signal  for  these  tests  as  used  in  [KSS86]  and  others.  The  signal  is  n  =  25 
samples  of 

x(t)  =  e>Wlt  +e-'Vu'5‘,  (4.97) 


with 


=  2t(0.52), 


(4.98) 


=  2t(0.50). 

Noise  of  variance  <r2  is  added  to  each  of  the  real  and  imaginary  parts.  The  signal  to  noise  ratio 
(SNR)  is  defined  as 


SNR  =  101og10( 


(4.99) 


At  each  of  several  signal  to  noise  ratios,  we  ran  500  trials.  For  each  trial  the  KiSS 
algorithm  (phase  1)  was  run  on  the  data  with  a  new  realization  of  the  noise.  The  Newton-based 
phase  2  was  started  from  where  KiSS  terminated.  The  results  of  the  KiSS  algorithm  alone  were 
compared  with  the  results  after  phase  2.  Figure  4.5  shows  the  performance  as  — 10  logir)( MSE) 
versus  SNR.  You  cam  see  that  the  Newton  method  is  of  limited  value  when  it  follows  convergence 
of  KiSS,  improving  performance  above  the  threshold  SNR  cf  9  dB  only  slightly  and  actually 
degrading  the  already  bad  performance  below  the  threshold  SNR.  The  Newton  method  may  be 
more  useful  if  used  after  fewer  iterations  of  KiSS.  or  from  some  other  starting  point. 


4.3  KiSS  with  Structured  Noise 

We  now  consider  the  problem  of  estimating  structurally  -onstrair  >d  sianai  subspaces 
in  the  presence  of  structured  noise.  We  extend  the  KiSS  algorithm  to  account  for  s;  pictured 
noise  with  a  known  subspace  of  dimension  t  If  the  structured  noise  sunsna^  :s  not  known, 
but  corresponds  t..,  impulse  noise,  then  we  -.an  perform  a  search  over  -h-  p.--<siid..  structured 


74 


i 

30  H 

! 

25  ^ - * - ■ - * - 

0  5  10  15  20  25  30 

SNR 


Figure  4.5  Performance  of  KiSS  with  and  without  a  Newton  phase  2. 

noise  subspaces,  performing  the  extended  KiSS  algorithm  for  each  one  and  choosing  the  one 
that  minimizes  the  objective  function.  This  outer  optimizction  search  corresponds  exactly  to 
the  one  described  in  Section  3.4  for  the  same  problem  without  the  complex  exponential  signal 
model.  As  in  the  earlier  problem,  we  need  to  include  an  order  penalty  in  the  objective  function 
if  the  rank  t  is  unknown. 

Signal  model.  Let  the  signal  be  a  sum  of  complex  exponentials  as  in  Section  4.1: 

rrj 

x(t)  =  5Z«.^-  (4.100) 

1  =  1 

Suppose  the  observed  lata  consists  of  signal  at\.  plus  st.uctured  noise  bit),  plus  background 
.noise  ;/ if)  for  n  consecutive  time  indices 

.  n 


yi  t  .  =  ii  I  '  —  hi  I  I  —  i/i  I  i . 


t  =  1.2. 


i  ion 


75 


The  matrix-vector  formulation  is 


H  =  H  £  +  S&  +  u, 


(4.102) 


where  the  signal  is  x  =  H 9  with 


ec",  h  = 


■  21  2  2 

2  2 
Z1  22 


ecm.  (4  in?) 


The  structured  noise  is  b  =  S&  with 


ecn,  s  s  €n 


(4.104) 


The  background  noise  is  u  €  C". 

Objective  function.  Given  that  the  background  noise  is  zero-mean  white  Gaussian 
noise  we  could  derive  the  log-likelihood  function  and  maximize  it  to  obtain  the  ML  signal 
subspace  estimate.  But  this  time  we  will  take  the  equivalent  least  squares  approach.  The 
modeling  error  is  the  difference  between  the  observed  data  and  the  two  part  model  (signal  plus 


structured  noise): 


e  =u~  (H£+  Sfi) 


=  j-  HS 


(4.1C3) 


The  least  squares  objective  function  is 


2  -  „ 

e  =  e  e. 


(4.106) 


The  last  form  of  Equation  4.105  leads  to  the  following  form  of  the  objective  function  when 
optimum  values  of  £  and  2  have  been  chosen  (more  detail  is  given  in  the  chapter  on  parameter 
estimation]: 

e2  =  PHs.'il,  '4.107) 


PHS  =  H  Si  ('H  S]H  H  5; )  '  H  S]1 


Derivation  of  a  KiSS  type  algorithm.  A  key  ep  :n  'lie  derivation  of  *i)e  KiSS 
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algorithm  was  the  switch  from  H  to  the  Toeplitz  matrix  A  spanning  the  orthogonal  complement 
of  the  signal  subspace.  By  rewriting  the  objective  function  of  Equation  4.107  we  will  eventually 


be  able  to  make  the  same  switch. 

e2  =  uHu  -  UH  [H  S]  ([H  S]H  [H  S])  '  [H  S]"  a 

_a  u  SJ  [SWH  SHSJ  SHi 

The  inverse  can  be  found  by  the  block  matrix  inversion  formula  [Kai80]: 


(4.109) 


'hhh  hhs1_1_ 

ShH  shs 

’(H"H)_l  +  H#S  [SHS  -  S^PhS]-1  Sh(H*)h  -H#S  [ShS  -  ShPhS]_1 

-  [sws  —  swpHs] _1  Sh(H#)h  [shs-shphs]_1 


so 

e2  =  uHu- u.HPhU-  ti"PHS  [sff(I  -  PH)S]_1  S"pHll  +  a"PHS  [s"(I  -  PH)S]  SHu  + 


uH S  [s"(I  -  P„)S]  1  shphu  -  u"s  [S*(I  -  Ph)S]  s Hu 


(4.111) 


Now  make  the  switch  by  replacing  all  occurrences  of  PH  with  I  -  PA. 
e2  =  -  ji*(I  -  P A)a  -uH( I  -  Pa)s  [s^p as]  -I  s"(I  -  PA)a  + 

uH(i  -  pa)s  [stfpAsr 1  s Hu  +  UHS  [swpas]_1  s"(i  -  pA)a  -  uH s  [s^PaS]"1  shu 

=  UHPaU-UHP*S  [SwPAS]_1SHPA-i. 

(-1.112) 

Then  make  the  substitution  PA  =  A(AH  A)~l  AH  and  factor  as 


e2^uHA  {AHA)-l-(AHA)-1AHS[SHPAS}'iSHA(AHA)-iAHu-  (4.113) 


Finally,  taking  advantage  of  the  identity  A1*  n  =  Ya  we  can  write  the  objective  function 


as 

e2  =  a"  Yh  \(AhA)-1  -  (AffA)'1AffS  (S^PaS)"1  SwA(  Ah  A)-1 !  Ya.  (4.114) 

t  -f 


The  form  of  Equation  4.114  is  similar  to  the  form  of  the  KiSS  objective  function 
in  Equation  4.15.  The  difference  is  the  "correction  term"  subtracted  from  (A^Ai-1  which 
accounts  for  the  structured  noise  poiuon  of  the  model  The  new  objective  function  is.  in  fact. 


77 


invariant  to  all  structured  noise  in  the  subspace  (S). 

We  assume  that  S  is  fixed  (either  known  or  selected  as  a  candidate  in  an  outer  opti¬ 
mization)  and  try  to  minimize  Equation  4.114  with  respect  to  a.  We  propose  an  algorithm  to 
minimize  e2  that  corresponds  to  the  KiSS  algorithm:  Hold  the  expression  in  square  brackets  in 
Equation  4.114  fixed  at  each  iteration  and  solve  the  resulting  quadratic  minimization  problem 
for  the  new  a,  then  use  that  value  of  a  to  update  the  expression  in  square  brackets  for  the 
next  iteration.  Before  the  first  iteration  the  expression  in  square  brackets  is  set  to  I  —  Ps. 
Constraints  on  a  may  be  handled  in  the  same  way  as  for  the  KiSS  algorithm. 

The  KiSS  algorithm  with  structured  noise.  With  the  affine  constraint  of  Equa¬ 
tion  416,  the  objective  function  of  Equation  4.114  becomes 

e2  =  (cH  +  aHTH)YH  [(A*  A)-1  -  (A*  A)-1  AWS  (S*PaS)  SH A(AH  A)-1]  Y(To  +  c). 

(4.115) 

The  only  difference  from  the  KiSS  algorithm  without  structured  noise  is  the  matrix  Qt  that  we 
fix  for  each  iteration: 

Q,  =  Yh  [(AhA)“1  -  (AHA)'‘AffS  (S*PAS)-1  SWA(AHA)_1]  Y.  (4.H6) 

The  explicit  steps  of  the  constrained  KiSS  algorithm  with  structured  noise  are 

1.  Build  appropriate  Y,  T,c. 

2.  Set  Q  =  Y*(I  -  PS)Y. 

3.  Repeat  until  convergence: 

3a.  Let.  o  =  -[Re(TwQT)]“ 1  Re(TffQc). 

3b.  Let  a  =  To.  -r  c. 

3c.  Build  A  from  a. 

3d.  Let  Q  =  Yh  fA^A)*1  -  (A^Aj-'A^S  (StfPAS)~*  SwAi  AH  A)-1 j  Y. 

4.  Stop. 

Convergence  is  nested  in  the  same  way  as  the  original  KiSS  algorithm. 

Example:  Second  order  KiSS  ojective  function  with  structured  noise.  To 
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illustrate  the  difference  between  the  original  KiSS  objective  function  and  this  one  which  accounts 
for  structured  noise,  we  return  to  the  second  order  example  of  Figure  4.1  and  Figure  4.2.  When 
substantial  structured  noise  is  added  to  the  signal,  the  surface  is  greatly  distorted  as  shown  in 
Figure  4.6.  However,  when  the  modified  KiSS  objective  function  is  used,  accounting  for  the 
structured  noise,  the  surface  returns  to  its  original  characteristic  shape  as  shown  in  Figure  4.7. 


4 


Figure  4.6  KiSS  objective  function  corrupted  by  structured  noise 

Simulation  Results. 

Simulations  of  the  KiSS  algorithm  modified  for  structured  noise  have  not  been  ex¬ 
tensive  because  of  the  high  computational  cost  involved.  Enough  simulations  have  been  run. 
however,  to  make  the  following  observations.  The  algorithm  converged  in  all  the  trials  iover 
30.000)  whether  the  structured  noise  matrix  was  '‘orrect  or  not.  When  given  th*>  corr"'"r  struc- 
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Figure  4.7  KiSS  objective  function  with  structured  noise  accounted  for. 

tured  noise  matrix,  the  frequency  estimation  performance  (measured  by  mean  squared  error 
in  fi  or  fn)  of  the  algorithm  was  essentially  the  same  as  the  regular  KiSS  algorithm  in  the 
absence  of  structured  noise.  In  contrast,  the  use  of  the  regular  KiSS  algorithm  when  structured 
noise  was  present  resulted  in  very  poor  estimator  performance,  as  one  would  expect  considering 
Figure  4.6. 

The  main  set  of  tests  was  conducted  as  follows.  The  test  signal  was  the  same  as  the  one 
we  used  for  the  KiSS  algorithm,  with  25  samples  of  two  complex  exponentials.  The  background 
noise  was  added  in  the  same  way  as  for  the  regular  KiSS  tests,  and  the  SNR  computed  the 
same  way.  We  added  a  fixed  structured  noise  vector  to  the  signal  and  background  noise,  with 
impulses  in  positions  2  and  23  whose  values  were  — 5-*-7 j  and  4  —  3 j  respectively.  Leaving  out  the 
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order  selection  aspect  of  the  problem,  we  assumed  knowledge  of  t  =  2  impulses  and  conducted  a 
combinatorial  search  over  the  ("35)  possible  locations  for  the  pair  of  impulses.  For  each  candidate 
S  we  applied  the  structured  noise  version  of  the  KiSS  algorithm  until  it  converged.  We  then 
chose  as  our  estimate  of  S  the  candidate  for  which  the  smallest  value  of  the  objective  function 
had  been  obtained,  and  took  the  corresponding  estimate  of  the  AR  parameters  a. 

To  be  blunt,  this  method  of  identifying  the  structured  noise  subspace  did  not  work. 
The  correct  subspace  was  chosen  zero  times  out  of  50  background  noise  realizations  at  20 
dB  SNR,  and  zero  times  out  of  50  even  with  very  clean  data  at  50  dB  SNR.  The  frequency 
estimation  performance  was  also  very  bad,  since  the  structured  noise  was  not  successfully 
removed  from  the  data.  We  believe  the  algorithm  was  coded  correctly,  although  a  program  bug 
is  not  impossible  as  the  explanation.  Oi.r  conclusion  is  therefore  that  this  is  not  a  good  way  to 
identify  the  structured  noise  subspace.  We  observe  that  a  better  way  to  identify  the  structured 
noise  subspace  might  be  found  by  comparing  Figures  4.6  and  4.7.  From  these  figures  it  appears 
that  the  error  surface  is  “smoother’’  when  the  structured  noise  has  been  correctly  accounted 
for.  If  this  observation  could  be  quantified  it  might  lead  to  a  better  identification  technique 
for  simultaneously  identifying  complex  exponentials  and  impulse  noise.  We  leave  this  idea  as  a 
suggested  extension  of  this  work. 


CHAPTER  V 


Order  Selection  for  Subspace  Identification 

In  this  chapter  we  consider  the  problem  of  selecting  the  appropriate  dimensionality  of 
a  subspace.  We  consider  two  kinds  of  subspace  order  selection  problems.  In  the  first,  there  is  a 
known  prior  subspace  model,  but  it  may  be  beneficial  to  reduce  the  rank  of  that  model  because 
of  the  presence  of  noise.  This  is  a  typical  rank  reduction  problem,  where  we  trade  off  model 
bias  and  variance  to  minimize  mean  squared  error.  In  the  other  order  selection  problem,  we 
have  no  known  prior  subspace  model.  Instead  we  are  trying  to  identify  the  subspace  from  the 
data  as  in  Chapters  III  and  IV.  It  is  not  clear  that  the  bias  versus  variance  tradeoff  applies, 
since  we  have  no  model  against  which  to  measure  bias. 

5.1  Rank  Reduction  of  a  Prior  Signal  Subspace  Model 

The  purpose  of  rank  reduction  is  to  reduce  the  Mean  Squared  Error  (MSE)  of  an 
estimator.  We  measure  the  error  in  the  observation  space,  noting  that  it  could  also  be  measured 
in  the  parameter  space.  Since  the  actual  Squared  Error  (SE)  is  unknown,  one  class  of  order 
selection  rules  involves  making  an  estimate  of  the  SE  at  each  rank,  and  choosing  the  rank  for 
which  this  estimate  is  smallest.  The  order  selection  rule  of  [ScS86]  is  of  this  type,  as  is  our  new 
approach.  Other  methods  of  order  selection  in  similar  situations  are  given  in  [Aka74],  [SMB87], 
[Mar87],  [Kum85]. 

We  begin  with  the  linear  statistical  model  of  complex  data  >/, 

U  =  j  4-  v 


r  i 

u 

— 

H 

2 

4- 

1 y 

n  n  *  m  m  n 
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In  this  equation,  H  is  a  known  system  matrix  which  spans  the  m-dimensional  signal  subspace 
(m  <  n),  £  is  the  parameter  vector  and  x  =  H£  is  the  signal.  The  additivv  noi°e,  u,  is  zero 


mean,  white,  and  Gaussian,  with  independent  real  and  imaginary  parts: 

ReQi):  N(G,i<72I) 

\  (5.2) 

Im(£)  :  N(Q,  -*2I). 

At  the  receiver  both  the  signal  x  and  the  parameter  £  are  unknown,  and  a  signal 


estimate  is  desired.  The  maximum  likelihood  estimate  of  x  is  the  least  squares  solution  obtained 


by  projecting  tj  onto  the  signal  subspace.  We  shall  denote  this  signal  estimate  by  since  it 
lies  in  the  m-dimensional  signal  subspace: 


=  PhU,  (5.3) 

where 

PH  =  H(H^H)-iH//.  (5.4) 

The  estimator  x^  is  unbiased,  and  has  variance  mer2.  A  reduced  rank  estimator  is 
formed  when  we  replace  PH  in  Equation  -5.3  with  a  lower  rank  projection  Pr  (r  <  m),  whose 
range  is  contained  in  the  range  of  PH: 


£r  =  PrU- 


Using  Pr  reduces  the  variance  of  the  estimate  from  mer2  to  rcr2 ,  but  introduces  bias  £.  given 
by 

6r  =  (I-Pr)r  =  (P„-Pr)r.  (5.6) 

Figure  51  shows  how  the  error  can  be  decomposed  into  two  orthogonal  components,  one  due 
to  noise  variance  and  the  other  due  to  the  bias  introduced  by  rank  reduction  Because  of 
orthogonality,  the  squared  error  can  be  estimated  as  the  sum  of  'he  variance  and  'he  squared 
magnitude  of  the  bias.  Since  the  variance  is  known,  we  need  an  estimate  of  the  squared 
magnitude  of  the  bias,  b2  —  b?br.  In  the  following  pages  we  review  several  estimators  of  hi  and 
introduce  a  new  estimator  based  on  two  sequential  applications  of  the  principle  of  maximum 
likelihood. 
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Figure  5.1  Orthogonal  decomposition  of  error. 

Once  we  have  estimated  the  squared  bias,  we  choose  the  rank  for  which  our  estimate 
of  the  squared  bias  plus  the  variance  is  minimum.  But  there  is  more  tc  choosing  the  low  rank 
projection  Pr  than  the  choice  of  rank.  We  must  also  choose  the  orientation  of  the  subspac-;. 
For  this  we  return  to  the  model  matrix,  H,  and  apply  the  singular  value  decomposition.  The 
left  singular  vectors  of  II  orthogonally  span  the  signal  subspace,  and  a  subset  of  the  singular 
vectors  is  chosen  to  form  the  range  of  Pr.  The  subset  may  be  chosen  by  taking  the  singular 
vectors  corresponding  to  the  largest  singular  values,  or  by  a  data  dependent  method  such  as  that 
suggested  in  .KTS84],  and  [SchOl].  In  most  of  our  simulations  -ve  have  used  a  Jma  dependent 
selection  method. 

Estimators  of  Squared  Bias.  We  now  consider  how  he  squared  bias  may  be 
estimated  from  the  data  for  a  given  low  rank  projection  P„.  Recall  dint  die  squared  bias  is 

h;  =  !d%  =  x"(i  -  p-  ■£  =  p„  -  p-  ■£ 


defined  as 
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Of  course,  we  have  destroyed  the  unbiasedness  in  the  modification.  For  a  given  rank,  each  of  the 
estimators  of  bias  squared  may  be  plotted  as  a  function  of  the  maximum  likelihood  estimator 
as  shown  in  Figure  5.2  for  (m  —  r)  =  5. 
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Estimators  of  Squared  Bias 


0  3  4  6  8  10  12  14  16  18  20 


ML  estimator 


Figure  5.2  Four  estimators  of  squared  bias. 


In  the  remainder  of  this  section  we  develop  our  primary  result,  a  new  estimator  of 
squared  bias  which  will  be  called  estimator  #4.  Consider  the  normalized  maximum  likelihood 
estimator  of  squared  bias: 


*2  = 


(5.12) 


The  distribution  of  i2  is  noncentral  chi-squared  with  m  —  r  degrees  of  freedom  and  noncentrality 
parameter  A  =  ^r62.  Thus,  the  estimation  of  6;  may  be  posed  as  the  fundamental  problem 
of  estimating  the  noncentrality  parameter  in  a  noncentral  chi-squared  distribution  with  known 
degrees  of  freedom,  based  on  a  single  observation.  A  similar  estimation  problem  was  considered 


by  Mever  [Mey67],  but  with  only  two  degrees  of  freedom  and  with  multiple  observations. 

We  obtain  the  maximum  likelihood  estimator  b ;  for  our  case  by  maximizing  rhe  like- 
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lihood  function: 


*r,# 4  =  arS^  Xm- r(x2;\). 


In  this  Equation,  Xm-r(t!;  A)  is  the  noncentral  chi-squared  density  function  of  x2  with  m— r  =  d 
degrees  of  freedom  and  noncentrality  parameter  A.  It  is  given  in  [Lan69]  as 

2^ r(|)  ;_1 


Setting  the  derivative  of  Equation  5.14  with  respect  to  A  equal  to  zero,  we  obtain  the 
maximum  likelihood  equation 

«-4(-^-V)*(<*-2)  e-4e-i(x2)^-2)  ^  (4yjA>-> 

Z-  j-i 


2tr(f) 


n^+2*) 


2*r(?) 


El  2  /  J A 
j- 1 

J  =  1  j![](^  +  2fc) 


For  x2  ^  0  we  may  multiply  Equation  5.15  through  by 

2*r(g) 

e'^e'  >(x2)i(<<~2) 


and  simplify  to 


i  ^  (4yj»~l  -  k(4y a> 

+  - I^Z_  =  0.  (5.16) 

j=l  J-  jj  +  2fc) 

k=0 

The  result  is  a  power  series  in  A,  and  may  be  approximately  solved  by  any  of  several  numerical 
approaches.  The  power  series  may  be  made  more  explicit  by  rewriting  Equation  5.16  as 


X>;A;=°. 

;=0 


{4y(rJ^2l) 


jlf[(d  +  2k) 


If  Equation  5.17  has  a  root  for  positive  real  A,  that  root  is  the  ML  estimator  of  the 
noncentrality  parameter.  If  the  largest  real  root  is  negative,  the  ML  estimator  of  noncentralitv 
is  zero,  since  the  maximization  of  Equation  5.13  is  restricted  to  positive  A. 
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There  are  some  theoretical  curiosities  about  the  new  estimator  of  squared  bias  (#4). 
It  was  obtained  by  two  sequential  applications  of  the  principle  of  maximum  likelihood.  First, 
through  invariance,  estimator  #1  was  found  to  be  the  ML  estimator  of  squared  bias.  Then  this 
first  estimator  was  considered  as  a  statistic  and  its  distribution  was  known  to  be  chi-squared 
with  noncentrality  parameter  equal  to  the  squared  bias.  So  ML  was  applied  again,  based  on 
the  chi-squared  statistic,  t.o  obtain  our  new  estimator.  One  can  envision  tms  process  t>eing 
extended  to  further  iterations  of  the  ML  principle,  although  the  complexity  of  the  likelihood 
functions  increases  rapidly,  so  that  even  one  more  iteration  appears  intractable. 

Also  of  interest  is  the  fact  that  the  first  ML  estimator  of  squared  bias  is  not  a  sufficient 
statistic  for  squared  bias,  as  can  be  seen  by  application  of  the  Fisher-Neyman  factorization 
theorem  [Sch91].  This  fact  tends  to  undermine  the  credibility  of  the  new  estimator.  Even  so, 
the  simulations  show  that  it  outperforms  the  other  three  estimators  at  low  signal-to-noise  ratio. 
At  high  signal-to-noise  ratio  all  four  estimators  perform  about  the  same,  and  high  rank  models 
are  generally  preferred. 

Simulations.  We  have  performed  simulations  using  MATLAB™  to  test  and  com¬ 
pare  estimators  #1  through  #4  and  the  corresponding  order  selection  rules  on  real- valued  data. 
Equation  5.17  is  solved  numerically  by  truncating  the  series  and  rooting  the  resulting  polyno¬ 
mial.  The  typical  pattern  of  root  locations  from  this  process  is  shown  in  Figure  5.3,  where  we 
are  interested  in  the  one  positive  real  root.  To  save  time  during  actual  simulations,  we  precom¬ 
pute  the  solutions  to  Equation  5.17  for  a  range  of  the  statistic  x 2  from  .1  to  50  in  increments 
of  .1,  and  for  1  to  10  degrees  of  freedom.  The  ML  function  is  then  evaluated  by  table  look  up 
with  linear  interpolation.  For  larger  values  of  r2,  which  occur  at  high  signal-to-noise  ratio,  the 
ML  function  is  nearly  linear,  and  we  use  a  linear  approximation. 

We  have  run  four  sets  of  tests  at  5  values  of  SNR  for  each  set.  In  each  set  200 
realizations  of  the  noise  vector  have  been  simulated  at  each  SNR.  In  the  first  set  the  parameter 
vector  is  also  chosen  randomly  at  each  realization,  while  the  parameter  is  fixed  in  the  other 
three  sets  of  tests.  All  four  bias  estimators  are  applied  to  each  realization  and  tne  sample 
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mean  squared  error  of  the  resulting  signal  estimators  is  observed  along  with  the  order  selection 
behavior  of  each  rule. 
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Figure  5.3  Roots  of  truncated  power  series. 


For  the  first  set  of  tests,  we  let  n  =  10  and  m  =  6.  For  each  signal-to-noise  ratio  we 
fix  a  randomly  generated  system  matrix  H  and  run  200  trials.  In  each  trial  a  new  parameter  £ 
and  a  new  noise  vector  are  generated  according  to 

£:  mi). 


v  :  .V(0,  cr2I). 


(5.19) 


The  signal-to-noise  ratio  used  is  the  expected  per-sample  SNR,  calculated  as 
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The  actual  squared  error  (SE)  for  a  typical  realization  at  0  dB  is  plotted  versus  rank  in  Figure 
5  4.  Also  shown  are  estimators  #2  (labeled  CM)  and  #4  (labeled  IML).  In  the  realization 
shown,  both  rules  selected  rank  4  while  the  best  choice  was  rank  6. 

Figure  5.5  shows  a  plot  versus  SNR  of  the  observed  sample  mean  squared  error  (MSE) 
between  the  true  signal  s.  and  the  low  rank  estimator  £r  chosen  by  each  of  the  proposed  order 
selection  rules.  The  Oracle  curve  gives  the  sample  MSE  of  a  hypothetical  order  selection  rule 
that  always  makes  the  best  choice  to  minimize  the  true  error  and  thus  represents  the  best 
possible  performance.  Success  as  an  order  selection  rule  is  plotted  versus  SNR  in  Figure  5.6, 
where  the  measure  of  success  is  the  number  of  trials  out  of  200  where  the  rule  made  the  right 
choice  of  rank,  in  agreement  with  the  oracle.  The  histogram  in  Figure  5.7  gives  a  more  complete 
characterization  of  the  order  selection  behavior  of  rule  #4  at  0  dB.  For  example,  the  bar  above 
I  indicates  how  often  the  rule  overestimated  the  best  rank  by  1. 

In  the  second  set  of  tests,  we  have  selected  for  H  a  10  x  3  matrix  representing  three 
sinusoids  with  relative  radian  frequencies  ff ,  ^ ,  and  .  The  singular  values  of  this  system 
matrix  are  approximately  2.9,  2.2  and  1.0.  The  parameter  is  fixed  for  all  trials  at  £=  [1  I  l]r. 
The  results  for  this  set  of  tests  are  given  in  Figures  5.5  and  5.9,  where  SNR  is  here  defined  as 

SNR  =  10  log10(  =— 4).  (5.21) 

TUT- 

The  third  set  of  tests  differ  from  the  second  only  in  the  new  choice  of  £  =  [0  1  l]T. 
Thus  for  these  tests  the  signal  lies  in  a  rank  2  subspace  in  terms  of  H,  but  not  in  terms  of  the 
left  singular  vectors  of  H.  Results  for  this  test  are  given  in  Figures  5.10  and  5.11. 

For  the  fourth  set  of  tests  the  same  system  matrix  is  used  as  for  the  second  and  third 
sets.  This  time  the  parameter  vector  has  been  chosen  to  result  in  a  signal  that  is  very  nearly 
rank  2  in  terms  of  the  left  singular  vectors  of  H,  namely  £  =  [0.3  1  1]T.  Also  different  in  this  set 
of  tests  is  the  method  used  to  choose  the  priority  of  the  singular  vectors.  The  first  three  tests 
use  the  data  dependent  method  described  in  [Sch91]  of  choosing  the  singular  vectors  whose 
inner  product  with  u  is  largest,  indicating  the  strongest  modes  for  each  trial.  In  this  test,  the 
singular  values  of  H  are  used  to  determine  the  dominant  singular  vectors.  Figures  5  12  and 
5-13  show  the  results  for  the  fourth  set  of  tests. 
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Figure  5.5  Sample  MSE  versus  SNR. 
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Figure  5.9  Order  selection  success. 
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Figure  5.12  Sample  MSE  versus  SNR. 
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Conclusions.  As  expected,  relatively  clean  data  is  best  modeled  by  the  full  rank 
system  matrix.  The  tests  generally  show  that  rank  reduction  becomes  useful  only  for  relatively 
noisy  data.  On  the  other  hand,  if  the  data  is  too  noisy  the  best  choice  of  rank  is  always  0. 
For  the  range  of  SNR  over  which  rank  reduction  is  useful,  the  new  order  selection  rule  (#4) 
performs  generally  better  than  the  others  under  the  conditions  of  these  tests. 

The  iterated  ML  approach  to  bias  estimation  leads  to  a  nonlinear  optimization  prob¬ 
lem.  As  such  it  requires  much  greater  computational  effort  than  the  linear  bias  estimator 
suggested  in  [ScS86].  However,  most  of  the  extra  computation  can  be  done  in  advance  and  our 
experiments  suggest  that  the  extra  computational  effor,  squired  by  the  iterated  ML  estima¬ 
tor  does  improve  performance.  The  concept  of  iterating  the  principle  of  maximum  likelihood 
remains  a  curiosity. 
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5.2  Order  selection  in  Subspace  Identification 

We  now  consider  the  order  selection  problem  of  identifying  the  number  of  impulses 
making  up  a  structured  noise  subspace,  assuming  that  the  signal  subspace  (H)  is  known.  In 
previous  work  [SMB87]  we  have  successfully  applied  the  order  selection  rule  of  Scharf  and  Storey 
[ScS86]  to  this  problem.  However,  the  derivation  of  that  order  selection  rule  is  based  on  the 
bias  versus  variance  tradeoff  that  occurs  in  the  rank  reduction  problem.  It  is  not  clear  that  it 
should  be  applied  to  order  selection  for  the  subspace  identification  problem. 

An  alternative  way  to  proceed  is  to  use  a  Bayesian  hypothesis  test  on  t,  the  number 
of  impulses  presumed  to  exist  in  the  observed  data  vector: 

H0:  t  =  0 


Hi  ■■ 


(5.22) 


Hq  t  —  q. 

The  maximum  number  of  impulses  we  can  successfully  correct  is  q  —  n  —  m  where  n  is  the  size 
of  the  data  vector  and  m  is  the  rank  of  the  signal  subspace. 

The  signal  model  and  assumptions  are  as  follows.  The  signal  subspace  H  is  known. 
The  data  consists  of  signal  plus  structured  noise  plus  white  Gaussian  background  noise: 


U  =  H  £+  S£+  u. 


The  structured  noise  is  impulsive  in  nature,  corresponding  to  a  structured  noise  matrix  S  whose 
columns  are  an  unknown  subset  of  the  columns  of  the  identity  matrix.  The  parameters  9  and  £ 
are  unknown.  We  consider  first  the  case  where  ail  variables  are  real  valued,  so  the  background 
noise  is  distributed  as 

k.  :  .V(Q,<t2I),  (5.23) 

where  we  will  consider  both  the  case  where  <r2  is  known  and  the  case  where  it  is  not  known. 

To  derive  the  Bayes  test  on  t  outlined  earlier,  we  need  to  know  the  joint  probability 
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density  function  of  the  data  u  and  the  number  of  impulses  t.  This  can  be  expressed  as 

=  4is,<(a|S,f)/s|<(S|<)/<(0-  (5.24) 

The  conditional  density  of  y.  given  S  and  t  is  a  joint  normal  inherited  from  the  noise  density: 

4ls.«(a|S,<)  =  (2^2)"n/2exp  (-^(a-H£-S4)T(a-H£-S£))  .  (5.25) 

To  get  suitable  density  functions  for  S  and  t  let  us  assume  that  impulse  errors  affect 
the  data  samples  independently,  and  each  sample  is  affected  with  probability  p.  This  implies 
that  the  number  of  impulses  obeys  a  binomial  distribution,  while  all  combinations  of  t  impulses 
occur  with  the  same  probability.  In  this  case  the  probability  mass  function  for  t  is 

f,(i)=  (")p*(l (5.26) 

The  conditional  probability  mass  function  for  S  given  t  is  uniform: 

/s|«(S|0=^y.  (5.27) 

To  implement  the  Bayes  test  for  t  we  must  choose  t  and  the  unknown  parameters 
S,£,  and  possibly  <r 2  to  maximize  the  joint  density  of  y  and  t.  This  is  of  course  equivalent 
to  minimizing  the  negative  log  of  the  joint  density  function,  so  we  now  write  our  objective  as 

g  minj(  —  H£  —  S&)T(y  —  H£  —  S&)  +  ^  ln(2T<r2)  —  In  (p!(l  —  p)n-t)  .  (5.28) 

Minimization  of  this  objective  with  respect  to  9  and  £  may  be  accomplished  by  substituting 

the  ML  estimates  of  these  parameters  as  derived  in  Chapter  VI.  resulting  in  the  objective: 

mi»  2^T(I—  pHs)a+  ^  ln^xo-2)  -  In  (p'(l  -  p)n~‘)  .  (5.29) 

We  now  proceed  to  evaluate  this  objective  function  for  two  cases  depending  on  whether  or  not 
<7~  is  known. 

Known  Noise  Variance.  When  cr-  is  known  the  second  term  in  Equation  5.20  is 
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constant  and  may  be  dropped,  resulting  in  the  objective: 

min  -  PHs)a- In  (p'(l -P)n~')  .  (5.30) 

Since  t  is  determined  for  each  S,  this  objective  may  be  viewed  as  a  criterion  for  choosing  S.  For 
each  of  the  finite  number  of  possible  selection  matrices  S  (and  its  corresponding  t),  the  function 
in  Equation  5.30  is  computed.  The  Bayes  hypothesis  test  for  order  selection  with  known  noise 
variance  is  to  choose  the  S  for  which  the  computed  value  of  Equation  5.30  is  smallest. 

This  test  may  be  compared  to  the  order  selection  rule  used  in  [SMB87]  in  which  the 
objective  function  may  be  expressed  as: 

min  -4^(1-  PHs)u+(2t  -  m)  .  (5.31) 

In  both  rules  the  first  term  is  data  dependent  while  the  second  may  be  viewed  as  an  order 
selection  penalty  function  favoring  some  values  of  t  over  others.  Figure  5.14  shows  a  graph 
comparing  the  penalty  functions  for  n  =  10,  with  several  values  of  the  probability  of  impulse  p. 
The  rule  of  [SMB87]  is  linear  in  t.  In  the  figure  equal  slopes  imply  equivalent  order  selection 
rules.  You  can  see  that  for  n  =  10  the  linear  rule  is  most  like  a  Bayes  rule  with  p  about  equal 
to  0.1.  The  figure  also  shows  that  the  Bayes  rule  gives  greatest  favor  (least  penalty)  to  rank 
t  =  np,  the  expected  value  of  the  number  of  impulses  present. 

Unknown  Noise  Variance.  When  a 2  is  not  known  we  substitute  the  ML  estimate 

into  Equation  5.29  as  we  did  for  the  parameters  9  and  Minimizing  Equation  5.29  with  respect 

to  <r2  gives  the  following  ML  estimate  of  <r2: 

d2=  -/(I  —  Phs  )li-  (5.32) 

n 

Substituting  this  into  the  objective  function  gives 

min  ^  ^  In  f—  uT(l  -  Phs)^)  -  In  (p*(l  -  p)n~‘)  .  (5.33) 

s.t  [2  2  \  n  1 

The  first  term  is  constant  and  may  be  dropped  from  the  minimization.  We  may  obtain  an 
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Order  selection  penalties 


Order  (number  of  impulses) 
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The  matrix  H  and  the  parameters  £  and  £  may  also  be  complex,  but  S  is  still  a  selection 
matrix.  The  only  differences  this  introduces  into  the  objective  function  for  order  selection  is 
the  replacement  of  the  transpose  by  the  Hermitian  transpose  and  the  replacement  of  n/ 2  by  n 
in  the  normalization  of  the  Gaussian  density. 

For  cr~  known  the  objective  function  for  complex  data  is 

min  -  PHS)a- In  (p'(l  -  p)n_t)  .  (5.36) 

S,i 

The  appropriate  form  of  Equation  5.31  for  complex  data  is  the  order  selection  rule  from 
[SMB87]: 

min  -  Phs)h+  (2<  -  m)  .  (5.37) 

For  a2  unknown  the  objective  function  for  complex  data  is 

min  [n  In  P hs)u)  -  In  (p'(l  -  p)n~')]  ,  (5.38) 


and  the  alternate  expression  for  this  objective  function  is 


min 
s  ,t 


'(■k"(i-pHS)ar 
.  p‘(i-p)n_‘ 


(5.39) 


Simulation  results.  The  order  selection  rules  presented  above  were  tested  on  sim¬ 
ulated  complex  data.  There  are  three  rules  to  be  compared,  corresponding  to  Equations  5.38, 
5.37  and  5.36.  The  three  were  tested  in  parallel  on  the  same  data. 

We  chose  n  =  10  samples  of  the  same  complex  exponential  signal  used  to  test  the  KiSS 
algorithm.  We  then  added  structured  noise  according  to  the  model,  with  each  vector  element 
having  probability  p  =  0.1  of  an  impulse.  The  impulse  amplitudes  were  chosen  with  a  fixed 
magnitude  of  5  and  uniform  random  phase  in  the  complex  plane.  Background  noise  was  added 
to  the  desired  SNR.  (defined  the  same  way  as  for  KiSS  tests  in  Chapter  IV). 

Table  5.1  shows  order  selection  performance  of  the  three  rules  at  signal-r.o-noise  ratios 
from  10  to  40  dB.  The  figures  in  the  table  are  the  number  of  times  out  of  50  trials  that  the 
correct  S  was  chosen  by  the  rule  in  question.  The  performance  of  the  Bayes  rule  is  the  best  in 
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the  table,  although  you  can  see  that  the  Bayes  rule  performs  about  the  same  as  the  Storey  and 
Scharf  rule.  With  unknown  noise  variance,  the  Bayes  rule  consistently  chose  a  rank  to  high, 
and  never  got  it  right  at  any  of  the  tested  SNR’s.  Perhaps  this  could  be  improved  by  adjusting 
the  normalization  in  the  estimate  of  the  noise  variance  <r2 ,  as  discussed  in  Section  3.2. 

Table  5.1  Order  selection  performance  of  three  rules. 


SNR  (dB) 

0 

10 

20 

30 

40 

Bayes  Rule 

19 

14 

14 

13 

14 

Storey/Scharf  Rule 

12 

10 

14 

12 

9 

Bayes  Rule  (unknown  variance) 

0 

0 

0 

0 

0 

CHAPTER  VI 


Parameter  Estimation  in  the  Structured  Noise  Model 

Recall  the  structured  noise  model  introduced  in  Chapter  I: 

It  =  x  +  £  +J ' 


‘ 

u 

= 

H 

£ 

+ 

S 

2 

4- 

U_ 
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.  _ 
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n 

L  - 

n  x  k 

k 

n  x  t 

t 

n 

The  goal  in  this  chapter  is  to  estimate  the  signal  component  r  or  the  parameter  £  based  on 
the  received  data  ij.  The  model  matrices  H  and  S  are  assumed  to  be  known,  or  previously 
estimated  as  in  Chapters  III  and  IV. 

We  consider  three  cases  of  this  estimation  problem:  (1)  We  assume  that  £  and  £  are 
real  or  complex  unknown  parameters  and  the  estimates  £  and  £  are  to  be  chosen  to  minimize  the 
squared  norm  of  the  fitting  error  between  (H£  +  S£)  and  u-  In  the  real  case,  this  is  equivalent 
to  placing  a  white  Gaussian  distribution  on  v  and  finding  the  ML  estimators.  (2)  We  assume 
♦hat.  the  noise  vectors  u  and  £  are  real  random  vectors  drawn  from  Gaussian  distributions  with 
known  autocorrelations.  (3)  In  addition  to  the  distributions  of  case  2,  We  assume  that  the 
parameter  vector  £  is  a  real  random  vector  drawn  from  a  Gaussian  distribution  with  known 
autocorrelation.  These  three  cases  may  be  summarized  according  to  the  density  each  one 
implies  for  to  the  data  u.: 

(1)  a:.V(H£  +  Sa,<7;I), 

(2)  a  :  .V(H£,<7;I  + SR„«Sr). 

(3)  u  .VU^I  +  SR^Sr*HR«Hri 


i  ri .  2 ) 


108 


In  the  least  squares  (LS)  section  (case  1),  £  and  2  are  assumed  to  be  real  or  complex 
unknown  parameters  and  the  estimates  £  and  2  are  chosen  to  minimize  the  fitting  error  between 
(H£  +  S2)  and  n.  In  the  minimum  variance  unbiased  (MVUB)  section  (case  21  the  noise 
processes  u  and  2  are  assumed  to  be  real  random  vectors  drawn  from  Gaussian  distributions 
with  known  autocorrelations.  The  minimum  mean  squared  error  (MMSE)  section  deals  with 
case  3.  The  last  section  shows  a  detailed  application  example  for  decoding  linear  block  codes 
over  the  complex  field. 

6.1  Least  Squares  Estimation 

In  this  section  we  consider  £  and  2  in  Equation  6.2  to  be  unknown  real  or  complex 
parameters  and  we  choose  estimates  of  them  which  minimize  the  Euclidean  norm  of  the  residual. 
£,  defined  as 

£  =  U-  (H?+S2).  (6.3) 

This  is.  of  course,  equivalent  to  an  ML  estimation  problem  when  the  vectors  are  real  and  v 
is  white  Gaussian  noise.  The  following  orthogonal  projection  operators  will  show  up  in  the 
solution: 


PH  =  H(HWH)'‘H" 

V=I-PH 

(6-11 

Ps  =  S(SwS)'lSw 
ps.  =I-P3. 

There  are  several  ways  to  proceed.  One  way  is  to  estimate  one  of  the  parameters  <9 
or  2)  13  a  function  of  the  other,  and  then  estimate  the  other  parameter  Another  way  is  to 
simultaneously  estimate  both  of  them.  The  same  solution  is  obtained  in  either  -ase.  and  we 
proceed  here  with  the  simultaneous  estimation  by  rewriting  the  model  equation  as 


This  equation  is  in  the  form  of  an  ordinary  linear  least  squares  problem,  for  which  'he  solution. 
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So  the  operator  which  solves  this  least  squares  problem  is  an  oblique  projection. 

Before  going  on  to  the  other  cases  of  the  estimation  problem,  we  give  some  interpre¬ 
tation  and  evaluation  of  the  present  result.  Figure  2.1  in  chapter  II  shows  how  Euclidean  space 
can  be  resolved  into  three  parts.  It  illustrates  the  action  of  EH.S  on  each  component  of  received 
data.  The  range  of  EH.S  is  the  signal  subspace  (H),  and  the  null  space  contains  the  structured 
noise  subspace  (S) .  The  remainder  of  Euclidean  space.  ;  A)  =  (H.  S)  '  .  completes  t  he  null  space. 
This  leads  to  the  interesting  observation  that  the  operator  Ejj;s  entirely  removes  the  structured 
noise,  since  according  to  the  model,  the  structured  noise  lies  in  the  null  space  of  EH:s  At  the 
same  time,  the  signal,  x  =  HO,  is  undisturbed  by  EH:s  since  it  lies  in  the  range. 

While  these  properties  are  highly  desirable,  there  is  a  tradeoff  involved.  The  unstruc¬ 
tured  noise  1/  is  not  dealt  with  as  effectively  by  the  oblique  projection  Eh,s  as  it  would  be  by 
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the  orthogonal  projection  Ph-  In  fact,  while  some  components  of  the  fu1'  rank  noise  v_  will 
be  reduced  or  removed,  certain  components  may  actually  be  amplified  as  shown  in  Figure  6.1. 
This  possibility  implies  that  the  oblique  projection  estimator  is  best  used  when  the  structured 
noise  dominates  the  full  rank  background  noise.  This  claim  is  supported  by  an  observation 
about  the  minimum  variance  unbiased  estimate  in  Section  6.2. 


<s> 


<H> 


Figure  6.1  Possible  effects  of  oblique  projections 


Ill 


Just  how  bad  is  the  effect  of  an  oblique  projection  on  the  background  noise?  How  can 
we  evaluate  a  given  oblique  projection  operator,  EH;s,  in  terms  of  its  effect  on  the  unstructured 
noise?  The  singular  value  decomposition  (SVD)  of  EH;S  plays  an  important  role  in  this  analysis. 
The  SVD  of  Ehjs  is 

E„!S=UEVH,  (6.10) 


where  U  and  V  are  unitary  and  E  is  diagonal  (all  are  n  x  n): 

UU*  =  u"u  =  I 
VVH  =  VHV  =  I 


S  = 


<7l 


(6.11) 


The  diagonal  elements  of  E  are  the  singular  values  of  EHis-  Singular  values  are  always  non¬ 
negative  reals,  and  as  shown  in  Chapter  II  the  singular  values  of  an  oblique  projection  may  take 
on  values  of  0,  1  or  any  value  greater  than  1.  This  is  an  important  distinction  from  orthogonal 
projections,  whose  singular  values  are  all  either  0  or  1. 

The  worst  case  noise  power  gain  is,  of  course,  given  by  the  squared  2-norm  of  the 
operator,  which  is  the  square  of  the  largest  singular  value  of  EH;s-  If  the  noise  vector  u  is 
resolved  onto  the  basis  V  formed  by  the  right  singular  vectors  of  EH;s>  then  each  component 
will  be  multiplied  by  the  corresponding  singular  value.  If  v  has  a  spherically  symmetrical 
distribution,  it  follows  that  the  average  noise  power  gain,  g,  is  the  average  of  all  the  squared 
singular  values  of  EH;s,  of  which  only  k  are  nonzero: 


1  v-'  i 

3ave  ~  „  2-  a'  ' 


i=l 


Oman  =  irf  >  >  •  •  •  >  <TJ  >  0. 


(6.12) 


We  now  give  a  geometric  interpretation  of  this  result.  From  Chapter  II  we  know  that 
the  nonzero  singular  values  of  an  oblique  projection  EH;s  are  related  to  the  principal  angles 
between  its  range  and  null  space.  In  the  present  setting  this  means  that  the  background  noise 
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gain  is  determined  by  the  principal  angles  between  the  signal  subspace  (H)  and  the  structured 
noise  subspace  (S),  with  the  worst  results  when  the  angles  are  small.  With  <Xi  representing  the 
principal  angles  we  repeat  the  relation  from  Chapter  II, 


*  =  — r •  6.13 

sm(ort) 

It  should  come  as  no  surprise  that  the  performance  worsens  when  structured  noise  is  close  to 
signal  by  some  measure,  such  as  the  sine  of  an  angle  between  the  subspaces. 


6.2  Minimum  Variance  Unbiased  Estimation  with  Real  Data 

We  now  return  to  the  model  Equation  6.2  and  assume  for  case  2  that  all  vectors  and 
matrices  are  real  and  the  noise  processes  v  and  £  are  independent  normally  distributed  random 
vectors  with  zero  means  and  known  correlations: 

a. :  R-vi/), 

(6.14) 

S>:;V(Q,R^). 

We  do  not  treat  the  complex  valued  vector  case  here.  In  earlier  Chapters  we  have  treated 
a  complex  vector  with  independent  real  and  imaginary  parts,  each  of  which  was  distributed 
a-8  -V(Q,  <t2I)  ,  as  a  complex  normal  random  vector.  But  we  know  no  definition  for  a  normally 
distributed  complex  random  vector  with  arbitrary  correlation. 

Let  w  —  S &+  v  denote  the  combined  noise.  Then  w  itself  is  normally  distributed. 
The  model  takes  on  the  form  of  the  linear  statistical  model 

H  =  H£  4-  w, 

(6.15) 

w  :  .V(Q,  Rww). 


R-uJtt  —  R-i /v  4  SrR4,®S. 


The  unknown  parameter  9  is  to  be  estimated.  The  ML  estimator  of  9  [SchOl]  is  also 
the  minimum  variance  unbiased  (MVUB)  estimator  of  9.  It  is  given  by 


l=(HrR,;iH)"lHrR;U. 
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The  existence  of  R~^  is  guaranteed  by  the  fact  that  R^  is  positive  definite.  To  establish  this, 
first  note  that  R„„  is  positive  definite  by  the  assumption  that  y_  is  full  rank  noise.  The  matrix 
added  to  R„„  to  form  Ru,^  is  non-negative  definite,  so  the  sum  remains  positive  definite. 

A  signal  estimate  is  obtained  as  in  the  least  squares  section  by  operating  on  9  with  H: 


£=  H£ 
=  E  U 


(6.18) 


where 

E=H(HTR^iH)-1HTR^i;.  (6.19) 


Once  again,  it  involves  an  oblique  projection,  E. 

The  range  of  E  is  the  signal  subspace  H  as  in  the  least  squares  case.  Its  null  space  is 
not  so  easily  characterized,  although  the  following  special  case  gives  some  insight  and  shows  an 
additional  connection  to  the  least  squares  solution.  If  both  noise  processes  are  white  (R,^  =  cr'I 
and  R„„  =  cr;I),  then  the  oblique  projection  E  can  be  written  as 


E  =  H(HT(I  +  rSSr)~1H)-1Hr(I  +  rSST)-1. 


(6.20) 


where 

r  =  cr-/<T l  (6.21) 

When  r  is  zero  there  is  no  structured  noise  and  E  in  Equation  6.20  simplifies  to  PH, 
the  orthogonal  projection  onto  the  signal  subspace.  On  the  other  hand,  when  r  is  large,  the 
structured  noise  is  the  dominant  interference.  In  Chapter  II  we  found  that  in  the  limit  as  r 
goes  to  infinity,  the  quantity  (1+  rSSr)_1  converges  to  Psx-  From  this  it  immediately  follows 
that 

lim  E  =  E„.s.  (6.22) 

r  — oo 

In  other  words,  when  £  and  u  are  white  noise,  the  MVUB  estimator  converges  to  the  least 
squares  estimator  as  the  structured  noise  becomes  dominant. 

In  summary,  the  MVUB  estimator  is  an  oblique  projection  whose  range  is  (HI  and 


114 


whose  null  space  moves  from  (H)1  toward  (S)  as  the  structured  noise  power  increases  from  zero 
toward  infinity.  This  supports  the  earlier  claim  that  the  oblique  projection  obtained  in  the  least 
squares  problem,  whose  nullspace  is  (S),  is  best  suited  to  situations  in  which  the  structured 
noise  dominates  the  unstructured  noise. 

6.3  Minimum  Mean  Squared  Error  Estimation  with  Real  Data 

For  case  3  we  assume  Gaussian  distributions  on  u,  and  £: 

k.  :rV(Q,R«/„), 

SLR**),  (6.23) 

£:*V(Q,  R„). 

We  choose  an  estimator  %  to  minimize  the  mean  squared  error  between  x  and  x.  The  solution 
is  just  a  special  case  of  the  Gauss-Markov  estimator,  with  the  correlation  matrix  Ryy  =  cr;I  + 
SR^S7  +  HRtf«Hr  containing  a  term  accounting  for  structured  noise.  The  solution, 

x  =  RryR  ~ylu  =  HR#«HT(£r3l  +  SR»«ST  +  HR^H7’)"1]/,  (6.24) 

may  be  found,  for  example,  in  [P0088].  We  note  that  x  is  not  an  oblique  projection  of  a  in  this 
case,  nor  in  the  Gauss-Markov  estimator  without  structured  noise.  Also  unlike  the  previous 
results,  this  solution  is  valid  even  if  the  subspaces  (H)  and  (S)  are  overlapping.  This  can  be 
interpreted  to  mean  that  even  with  overlapping  subspaces,  the  information  contained  in  tne 
distribution  functions  allows  some  probabilistic  separation  of  signal  and  structured  noise. 

6.4  Application  to  Decoding  of  Block  Codes 

In  this  section  we  present  an  example  using  the  subspace  identification  and  parameter 
estimation  techniques  we  have  developed.  This  work  was  originally  published  by  Scharf.  Mathvs 
and  Behrens  [SMB87],  Since  then.  Mathvs  has  submitted  a  paper  [Mat90]  for  publication  that 
continues  and  expands  the  work. 

When  the  problem  of  decoding  a  linear  block  code  is  examined  from  the  perspective 
of  estimation  theory  it  is  seen  to  be  equivalent  to  the  problem  of  parameter  estimation  in  the 
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Linear  Statistical  Model.  In  this  section  we  exploit  that  equivalence  to  derive  a  procedure  for 
decoding  linear  block  codes  over  the  real  or  complex  field.  We  develop  a  decoding  procedure 
based  on  a  noise  model  that  includes  large  impulsive  errors  in  a  few  positions  of  the  codeword 
as  well  as  minor  errors  in  all  positions.  The  resulting  decoder  cam  be  represented  as  an  oblique 
projection  operator  determined  by  a  finite  search  algorithm.  Simulation  results  are  given  which 
show  the  decoder’s  performance  in  a  specific  situation. 

Several  authors  have  recently  contributed  to  the  extension  of  results  from  finite-field 
coding  theory  to  the  infinite  field  of  the  real  numbers  and  its  extension  field,  the  complex 
numoers.  Wolf  [Wol83]  has  noted  the  effectiveness  of  error  control  coding  against  impulsive 
errors  in  the  complex  field,  and  has  made  the  important  observation  that  the  error  correction 
capacity  of  such  codes  is  potentially  nearly  twice  what  finite-field  coding  theory  would  lead  one 
to  expect.  Marshall  [Mar85]  has  addressed  the  construction  and  implementation  of  complex 
number  codes  for  impulse  error  correction,  with  attention  given  to  the  dynamic  range  of  the 
elements  in  the  codeword.  Recent  work  in  signal  restoration  has  also  been  applied  to  complex 
number  codes  by  Marshall  [Mar86]. 

Practical  decoding  algorithms  for  impulse  errors  in  complex  number  codes  must  con¬ 
sider  that  the  codeword  may  also  be  subject  to  minor  errors  such  as  roundoff  and/or  background 
channel  noise  in  every  element.  Wolf  [Wol83]  and  Kumaresan  [I<um86]  have  presented  decod¬ 
ing  strategies  which  account  for  such  minor  errors  while  protecting  against  major  impulsive 
errors.  The  decoder  we  develop  in  this  paper  shows  significant  immunity  to  minor  errors  in 
correcting  multiple  impulse  errors.  In  addition,  some  protection  is  provided  against  the  minor 
errors  themselves. 

Block  Codes  and  the  Linear  Statistical  Model.  We  wish  to  recover  the  values 
of  real  or  complex  numbers  that  have  been  coded  and  transmitted  over  a  channel  subject  to 
impulse  noise  and  possibly  also  minor  errors  (background  noise)  in  each  value  transmitted.  By 
the  use  of  appropriate  linear  block  codes,  the  impulse  errors  can  be  detected  and  removed.  In 
the  absence  of  background  noise,  this  removal  can  be  entire  for  sufficiently  few  impulse  errors. 
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One  way  to  specify  or  represent  any  particular  linear  block  code  is  by  its  encoder 
matrix,  which  we  will  call  H.  The  information  values  to  be  sent  are  blocked  into  m-vectors  £, 
and  the  codeword  vector  x  has  length  n  (n  >  m).  Thus  the  dimensions  of  H  are  n  by  m.  The 
operation  of  encoding  is  expressed  by  the  matrix-vector  equation  x  =  H£  : 


’*1  “ 

‘/ill 

'  /*lm 

-  _ 

.  /^nl  ' 

’  hnm  - 

r*i' 

dm. 


(6.25) 


The  elements  Xj,  /iy  and  0;  may  be  real  or  complex,  and  H  is  full  rank  (rank  m)  for  all  uniquely 
decodable  block  codes.  The  received  data  t£  is  the  transmitted  codeword  x  plus  the  channel 


noise  e. 


U=  H0  +  e.  (6.26) 

Equation  6.26  has  the  form  of  the  Linear  Statistical  Model.  Results  based  on  this  connection 
will  be  used  in  the  development  of  the  decoder. 

Decoding  for  Impulse  Noise  Only.  By  impulse  noise  we  refer  to  an  n-vector  6, 
which  has  zeros  in  all  but  t  positions  (l  <  n-m).  Such  a  vector  has  Hamming  Weight  equal  to 
t,  the  number  of  non-zero  entries.  The  nonzero  elements  may  be  large.  In  the  next  subsection 
we  consider  the  combined  effects  of  impulse  noise  and  minor  background  noise,  but  at  present 
we  limit  our  consideration  to  impulse  noise  only.  In  this  case  the  channel  noise  in  Equation 
6.26  is  e  =  b. 

We  wish  to  determine  the  value  of  the  impulse  noise  b  by  an  examination  of  the 
received  data  u,  then  to  subtract  it  from  u  to  obtain  the  transmitted  codeword  x.  To  ensure 
that  the  estimate  £  is  sparce  according  to  the  impulse  noise  model  for  b,  we  represent  b  as  the 
product  of  a  selection  matrix  and  a  t-vector,  as  in  the  structured  noise  model  described  earlier: 

4=S<2.  (6.27) 


The  vector  £  contains  the  (non-zero)  amplitudes  of  the  impulse  errors.  Each  column  of  the 
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n  x  t  selection  matrix  S  corresponds  to  one  impulse  error  and  determines  its  position.  In  any 
column  there  is  a  1  in  the  position  of  that  impulse  error  and  zeros  in  all  other  positions.  The 
order  of  the  columns  is  not  significant;  all  selection  matrices  which  are  column  permutations  of 
one  another  represent  the  same  set  of  impulse  positions  and  are  considered  equivalent. 

We  separate  the  estimation  problem  into  two  parts:  the  estimation  of  S,  which  is  a 
subspace  identification  problem,  and  the  estimation  of  the  parameter  The  subspace  identifi¬ 
cation  problem  is  equivalent  to  determining  the  number  of  impulse  errors  and  their  positions, 
and  is  accomplished  by  a  search  process  through  the  fixed,  finite  space  of  possible  selection 
matrices.  The  parameter  estimation  problem  is  equivalent  to  determining  the  amplitudes  of 
the  impulses,  and  is  accomplished  by  an  oblique  projection. 

VVe  give  here  an  alternate  derivation  of  the  oblique  projection  operator  EHiS  that 
solves  the  estimation  problem,  using  the  coding  theory  concepts  of  parity  check  matrices  and 
syndromes.  One  way  to  split  the  observation  space  Rn  (cr  C")  which  contains  the  data  u  is  into 
two  linear  subspaces:  the  m-dimensional  code  space  spanned  by  H.  and  its  (n  —  m)-dimensional 
orthogonal  complement  (H)x.  We  define  U  as  an  n  by  n  —  m  matrix  which  spans  the  subspace 
Hx.  This  gives  us  the  property  UWH  =  0.  Coding  theorists  may  recognize  U  as  the  parity 
check  matrix  for  the  code,  which  of  course  depends  only  on  H  and  can  be  determined  in  advance. 
The  parity  check  matrix  U  is  not  unique,  and  we  find  it  advantageous  to  use  an  orthonormal 
span  of  the  specified  subspace,  giving  us  the  additional  property  UHU  =  I.  Application  of  the 
parity  check  matrix  U  to  the  received  data  vector  2  produces  the  syndrome  z,  a  vector  with 
length  (n  —  m): 


:  =  \JHn=  UH(H£  +  b) 
=  VHb  =  UHSQ. 


(6.28) 


The  syndrome  is  independent  of  the  codeword  because  UWH  =  0.  What  remains  is 
isomorphic  to  the  projection  of  the  impulse  noise  vector  b  onto  (U),  the  subspace  orthogonal 
to  the  code  space.  The  projection  of  b  onto  (U)  is,  in  fact,  equal  to  U;  =  U XJHb.  We  use  the 
term  syndrome  space  to  refer  to  either  the  ( n  —  m (-dimensional  space  in  which  the  syndrome 


118 


3.  lies,  or  the  isomorphic  subspace  of  Rn  (or  <Cn)  in  which  U£  lies.  The  impulse  noise  vector  b 
may  be  resolved  into  a  component  in  the  code  space  and  a  component  in  the  syndrome  space. 
The  component  in  the  code  space  is  indistinguishable  from  a  legal  codeword,  so  we  turn  to  the 
syndrome  as  the  basis  of  our  estimation  process. 

If  we  assume  S  has  already  been  estimated  as  S  then  the  system  of  equations  repre¬ 
sented  by  £  =  is  overdetermined  (recall  t  <  n  —  m)  and  can  be  solved  in  a  least  squares 

sense  to  obtain  4-  Minimizing  the  usual  least  squares  cost  function  ||z  —  UWS^||2,  the  solution 
[GVL89]  is 

<k=Q  z;  where  Q  =  (STUUHS)_1  §TU.  (6.29) 

Q  is  the  pseudoinverse  of  UWS.  Existence  of  the  necessary  inverse  in  Equation  6.29  is 
related  to  the  relationship  between  the  code  space  and  the  space  spanned  by  S,  in  which  6  lies. 
If  the  two  share  a  common  subspace  of  nonzero  dimension,  then  that  subspace  of  S  is  orthogonal 
to  U  and  the  product  VH  S  will  be  rank  deficient.  The  physical  interpretation  is  that  some 
impulse  noise  vectors  in  the  §  subspace  are  then  legal  codewords.  This  situation  can  be  avoided, 
and  the  existence  of  the  required  inverse  guaranteed,  by  choosing  the  code  appropriately.  For 
example,  codes  constructed  by  application  of  BCH  or  Reed-Solomon  techniques  to  the  field 
of  reals  and  its  extension  field,  the  complex  numbers,  have  codewords  with  Hamming  Weight 
>  n  —  m  (except  the  zero  codeword).  In  this  case  no  pattern  of  n  -  m  or  fewer  impulses  has 
sufficient  Hamming  Weight  to  be  a  legal  codeword. 

We  now  observe  that  in  the  present  case  (with  only  impulse  noise  present)  the  overde¬ 
termined  system  of  equations  in  Equation  6.28  is  actually  consistent  if  the  estimate  of  S  is 
correct.  This  is  simply  because,  according  to  the  model,  the  syndrome  r  originated  from  some 
actual  £  as  UWS£.  The  search  for  S,  then,  is  a  search  for  the  smallest  (minimum  width  ?)  se¬ 
lection  matrix  for  which  Equation  6.28  is  consistent.  Or,  equivalently,  a  search  for  the  smallest 
selection  matrix  for  which  the  syndrome  :  lies  in  the  subspace  spanned  by  UHS. 

After  determining  S.  Equation  6.27  is  used  to  synthesize  b.  the  estimate  of  the  impulse 


Finally,  note  that  under  certain  conditions  the  estimate  2  is  exactly  equal  to  x  in  the 
absence  of  background  noise.  There  are  two  conditions.  First,  the  subspaces  (H)  and  (S)  must 
be  disjoint  so  that  the  inverse  in  Equation  6.30  exists.  Second,  the  smallest  (least  width)  matrix 
S  for  which  Equation  6.29  is  consistent  must  be  unique.  This  uniqueness  can  be  guaranteed 
only  for  appropriate  codes  (such  as  BCH  and  Reed-Solomon)  and  with  t  <  However, 

if  2  arises  from  a  continuous  probability  distribution,  then  the  circumstances  which  lead  to 
non-umqueness  occur  with  probability  zero  for  t  up  to  n  -  m  -  1.  Unique,  'ss  implies  that  the 
search  will  produce  S  =  S.  from  which  it  follows  that 


k  =  Esui^  =  b. 


(6.34) 
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8. 


Figure  6.2  A  Communication  System 

This  implies  that  ?  =  x ;  so  with  probability  one,  the  code  can  correct  t  <  n  —  m  impulse  errors. 

Decoding  for  Impulse  and  Background  Noise.  Now  we  consider  the  case  in 
which  minor  errors  (background  noise)  may  be  present  in  all  codeword  positions  in  addition  to 
the  major  impulse  errors.  For  simplicity  of  analysis  the  background  noise  u  is  assumed  to  be 
white  and  Gaussian:  EU/i/h\  =  cr2I  (£■{  ■ }  is  the  expected  value  operator).  In  the  complex 
case  we  assume  that  the  real  and  imaginary  parts  are  independent  white  Gaussian  random 
vectors  with  variance  ^  in  each  part.  The  effect  of  the  background  noise  on  the  decoding 
process  is  twofold.  It  corrupts  the  codeword  and  perturbs  our  estimate  of  the  impulse  errors. 

We  deal  first  with  estimation  of  the  impulse  errors  in  the  presence  of  the  background 
noise.  Then  we  will  use  whatever  error  correction  capacity  remains  to  reduce  the  background 
noise  effects  on  the  codeword  itself.  We  use  the  same  basic  approach  for  estimating  the  impulse 
errors  as  was  used  without  background  noise,  with  the  channel  noise  e  now  equal  to  b  plus  u. 
The  new  version  of  Equation  6.28  for  the  syndrome  is 
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z  =  VHu=  Vh{H0+S^  +  u) 

=  (UhS)£+  w;  (6.35) 

where  w  —  VHu. 

Observe  now  that  Equation  6.35  is  a  Linear  Statistical  Model  in  the  syndrome  space, 
for  the  syndrome  Furthermore,  with  U  being  orthonormal  (UWU  =  I),  the  new  noise  vector 
w  retains  the  whiteness  property  of  u: 


E{wwj1}  =  E{XJh wh  U}  =  UH  E{i'i/ff  }U 


(6.36) 


=  Uw(<r2I)U  =  cr'XjH  U  =  cr-I. 

As  before  we  determine  S  by  a  search  of  all  selection  matrices,  but  because  of  w  we 
can  no  longer  expect  an  exact  solution.  For  each  S  there  is  an  associated  £,  given  by  Equation 
6.29,  which  minimizes  the  least  squares  cost  function  ||c  -  U^SgH2.  For  a  given  t  (estimated 
number  of  impulses)  we  consider  the  best  S  to  be  the  one  for  which  the  minimized  cost  function 
is  lowest.  This  gives  us  one  S  for  each  candidate  t.  For  these,  the  least  squares  cost  decreases 
as  t  increases,  but  the  amount  of  decrease  tends  to  become  small  for  t  >  t.  This  is  because  the 
t  large  impulse  errors  make  much  greater  contributions  to  fhe  norm  of  the  syndrome  than  the 
smaller  background  noise  errors,  and  thus  tend  to  be  corrected  first. 

Choosing  t  too  small  results  in  uncorrected  impulse  errors,  while  choosing  it  too  large 
inappropriately  treats  some  background  noise  errors  as  impulses.  The  latter  reduces  our  ability 
to  deal  appropriately  with  the  white  background  noise  after  removal  of  the  estimated  impulse 
noise.  We  make  the  final  choice  of  t,  and  thus  of  S.  by  an  order  selection  rule  as  discussed  in 
Chapter  V.  Once  S  and  its  associated  £  are  determined,  the  estimated  impulse  noise  vector  b 
can  then  be  synthesized  according  to  Equation  6.27  and  subtracted  from  the  data: 

£  =  U.  ~  S£ 


(6.37) 


=  E-  v. 

S  :H  **■ 


The  communication  system  so  far  is  almost  identical  to  the  one  for  impulse  noise  only, 
shown  in  Figure  (3.2.  The  only  difference  the  background  noise  brings  is  in  the  selection  criteria 
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for  S.  Unlike  the  case  without  background  noise  however,  the  result  V  is  not  yet  the  codeword 
estimate.  It  does  not  necessarily  lie  in  the  code  space  and  we  can  still  remove  some  of  the 
effects  of  the  background  noise. 

We  now  proceed  to  deal  with  the  effects  of  the  background  noise  on  the  codeword. 
Under  the  assumption  that  the  impulses  have  been  correctly  located  (i.e.,  S  =  S),  E-  has  the 
properties  E-  H£  =  s.  and  E -  H6  =  0.  In  accomplishing  the  elimination  of  the  impulse  noise, 
though,  E|  H  colors  the  background  noise.  The  noise  affecting  £  is  not  white.  In  our  earlier 
paper  [SMB87],  we  derived  the  autocorrelation  matrix  for  £,  pre whitened  £,  and  then  applied  a 
least  squares  estimator  to  determine  ?.  This  process  lead  to  another  oblique  projection  operator 
Eft,  and  the  final  codeword  estimate  was  obtained  by  the  sequence  of  two  oblique  projections 

r  =  ER(I-Ei;H)a.  (6.38) 

As  it  turns  out,  further  simplification  is  possible.  The  product  Eft(I  -  E-  )  is  equal  to 
PH(I  -  E-  H),  and  from  the  three-way  resolution  of  identity  in  Chapter  II  we  can  write 

2  =  PH(I-E..H)a 

=  Pk<P,h;sC+Eh:s^ 

(6.39) 

=  PHE  -v 
"  h:s"* 

=  E  -u. 
his-* 

Therefore,  the  error  correction  decoder  involves  one  oblique  projection  and  a  search  for  the 
correct  selection  matrix.  This  is  exactly  the  result  one  would  expect  from  Section  6.1  on  least 
squares  parameter  estimation  in  structured  noise. 

Simulation  Results.  We  now  present  the  results  of  a  simulation  of  the  system  in 
Figure  6.2.  For  this  example  we  have  used  an  n  =  7  BCH  code  wherein  all  codewords  have 
zeros  in  positions  3,  4,  5  and  6  of  their  DFT.  allowing  m  =  3  information  values.  The  encoder 


matrix  H  for  this  case  is 
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We  fixed  9  and  b  for  50  realizations  of  the  background  noise  u. 


(6.40) 


9  = 


0.5  -  ;0.25 
-0.75  —  JO. 5  , 
0.25  +  >0.75  _ 


6  = 


0 

1.7  —  >1.7 
3.9 -;3. 9 
0 
0 
0 
0 


(6.41) 


The  order  selection  rule  was  adapted  to  the  complex  case  from  [SchS7], 

Signal  to  noise  ratio  (SNR)  is  defined  as  xjn<r‘ ,  where  <r~  is  the  variance  of  each 
element  in  i/  (real  and  imaginary  parts  have  variance  jcr2  each).  But  SNR  is  irrelevant  in  the 
syndrome  space  where  estimation  of  the  impulse  errors  occurs.  The  relevant  parameter  there  is 
the  major  to  minor  noise  ratio  (MMNR)  defined  as  bH b/na2.  We  express  both  in  dB  by  taking 
10  times  the  base- 10  log. 

The  plot  in  Figure  6.3  shows  how  close  the  final  estimates  of  the  information  values 
were  to  their  true  values  9.  The  relative  error  shown  is  -  £|p  'jdip  The  MMNR  used 
represents  quite  substantial  background  noise,  and  in  most  cases  S  was  correctly  determined. 

Conclusions.  We  have  derived  an  algorithm  for  decoding  linear  block  codes  over 
the  complex  field.  First  the  number  and  locations  of  the  impulses  are  determined,  effectively 
identifying  a  structured  noise  subspace.  Then  the  received  data  vector  is  operated  on  by  the 
oblique  projection  operator  whose  range  is  the  code  space  (signal  subspace  I  and  whose  null  space 
contains  the  identified  structured  noise  subspace.  Simulations  verify  that  the  decoder  performs 
well. 


a  Correct  S  (45) 
»  Incorrect  S  (5) 

1  outlier  (off  graph) 


Figure  6.3  Final  Estimates  of  Information  Values. 


CHAPTER  VII 


Conclusions 

We  have  presented  in  this  dissertation  a  collection  of  advancements  in  digital  signal 
processing  theory  and  practice,  based  on  the  use  of  linear  subspaces.  Most  of  the  new  algorithms 
are  related  to  our  proposed  “structured  noise  model” ,  wherein  both  the  signal  and  a  component 
of  the  noise  are  assumed  to  lie  in  low  rank  linear  subspaces.  We  have  argued  that  linear  modeling 
of  noise  is  often  appropriate  for  the  same  reasons  that  linear  modeling  applies  to  signals. 

Oblique  projection  operators  occur  naturally  in  the  solution  of  estimation  problems 
in  structured  noise.  We  consider  the  primary  theme  of  this  dissertation  to  be  the  emphasis 
on  oblique  projection  operators  as  useful  tools  in  signal  processing.  To  support  and  develop 
that  notion  we  have  presented  research  that  can  be  grouped  into  three  areas:  mathematical 
contributions,  subspace  identification  techniques,  and  estimation  problems  in  structured  noise. 
In  the  following  sections  we  discuss  each  of  these  areas  in  terms  of  implications,  limitations, 
and  possible  extensions. 

7.1  Mathematical  Contributions 

Most  of  the  mathematics  used  here  is  not  new.  Since  it  is  mostly  of  a  linear  algebraic 
nature,  it  can  be  found  in  such  books  as  the  classic  Matrix  Computations  by  Golub  and  Van 
Loan  [GVLS9].  But  there  are  a  few  mathematical  results  we  have  not  seen  elsewhere. 

The  equations  in  Chapter  II  for  construction  of  an  oblique  projection  with  a  specified 
range  and  null  space  are  our  own.  although  they  are  of  such  a  fundamental  nature  that  we 
would  not  be  surprised  to  discover  them  in  earlier  works.  These  equations  give  two  different 
expressions  for  the  same  oblique  projection  matrix.  These  expressions  are  essential  to  the  signal 
processing  applications  of  oblique  projections,  both  for  analysis  and  implementation. 
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The  natural  geometrical  interpretation  of  oblique  projections  built  from  our  expres¬ 
sions  is  a  three-way  resolution  of  Euclidean  space  into  the  components  of  signal,  structured 
noise,  and  the  remainder  of  the  space.  This  is  expressed  mathematically  in  our  three-way  reso¬ 
lution  of  the  identity  matrix  as  the  sum  of  two  oblique  projection  matrices  and  one  orthogonal 
projection  matrix. 

We  have  demonstrated  a  link  between  oblique  projections  and  orthogonal  projections 
by  deriving  a  coordinate  transformation  that  makes  an  oblique  projection  problem  into  an 
orthogonal  projection  problem.  This  gives  researchers  the  option  of  studying  the  properties  of 
the  coordinate  transformation  and  combining  them  with  the  known  properties  of  orthogonal 
projections  as  an  indirect  approach  to  the  study  of  oblique  projection  operators. 

The  singular  values  of  an  oblique  projection  operator  are  critically  important  in  ascer¬ 
taining  the  effect  of  that  operator  on  unstructured  background  noise.  We  have  pointed  out  that, 
unlike  an  orthogonal  projection  operator,  an  oblique  projection  may  have  singular  values  larger 
than  unity.  If  any  of  those  singular  values  are  too  large,  the  benefit  gained  from  elimination 
of  the  structured  noise  may  be  outweighed  by  amplification  of  the  background  noise.  We  have 
quantified  that  relationship  in  Chapter  VI.  We  have  als<->  discovered  a  geometric  interpretation 
of  the  singular  values  of  an  oblique  projection,  giving  a  relationship  in  Chapter  II  that  connects 
them  directly  to  the  principal  angles  between  the  range  and  the  null  space. 

The  main  limitation  of  the  mathematical  results  on  oblique  projections  is  simply 
that  the  range  (signal  subspace)  and  null  space  (structured  noise  subspace  plus  remaining 
space)  must  be  disjoint.  The  intersection  between  range  and  null  space  must  include  no  vector 
other  than  the  zero  vector.  In  a  signal  processing  context  this  means  that  we  cannot  build  a 
projection  operator  to  completely  separate  signal  and  noise  when  some  structured  noise  vectors 
are  identical  to  some  signal  vectors. 

7.2  Subspace  Identification  Techniques 

Before  one  can  build  an  oblique  projection  to  separate  signals  from  structured  noise, 
one  must  know  the  signal  subspace  and  the  structured  noise  subspace.  While  it  may  occasionally 
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be  possible  to  determine  both  subspaces  in  advance  through  theoretical  considerations,  it  is  very 
important  in  many  cases  to  be  able  to  identify  these  subspaces  from  observed  data  and  limited 
knowledge  of  the  nature  of  the  signals. 

The  principle  of  Maximum  Likelihood  is  an  old  standby  for  estimation  problems.  We 
have  adopted  it  in  many  of  our  subspace  identification  techniques.  However,  we  recognize  that 
it  is  not  always  an  appropriate  way  to  solve  estimation  problems.  In  Chapter  III  we  give  a 
critical  evaluation  of  the  Maximum  Likelihood  principle  and  an  example  using  a  quadratic 
equation  to  illustrate  the  pitfalls  of  blind  application  of  ML.  Our  purposes  in  including  this 
evaluation  of  ML  are  to  warn  the  reader  against  blind  application  of  our  subspace  identification 
techniques,  to  suggest  the  adaptation  of  our  techniques  to  some  kind  of  MAP  estimation  rule 
where  appropriate,  and  to  make  a  statement  of  our  reservations  about  the  careless  use  of  ML. 

The  easiest  class  of  subspace  identification  problems  is  signal  subspace  identification 
without  structured  noise.  In  this  class  we  have  presented  improvements  on  existing  algorithms 
for  the  nonparametric  problem  whose  solution  is  obtained  with  the  SVD,  and  for  the  parametric 
problem  with  complex  exponential  signal  modes.  The  latter  problem  frequently  needs  to  be 
solved  under  constraints  regarding  the  location  of  the  roots,  and  we  have  presented  an  extension 
of  the  “KiSS”  algorithm  that  incorporates  constraints.  We  have  also  found  elegant  expressions 
and  interpretations  for  the  gradient  and  Hessian  of  the  KiSS  objective  function  for  complex  data 
and  parameters.  A  Newton  algorithm  can  be  implemented  using  these  gradient  and  Hessian 
expressions. 

The  next  most  difficult  subspace  identification  problem  is  that  of  identifying  a  struc¬ 
tured  noise  subspace  when  you  know  vour  signal  subspace.  The  coding  example  of  Chapter  VI 
falls  into  this  category.  The  "‘KiSS  with  structured  noise-’  algorithm  of  Chapter  IV  solves  an 
equivalent  problem,  since  it  identifies  a  signal  subspace  when  the  structured  noise  subspace  is 
known. 

The  most  difficult  problem  is  the  simultaneous  identification  of  both  signal  and  struc¬ 
tured  noise  subspaces.  It  is  clear  that  this  cannot  be  done  without  some  prior  knowledge  that 
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will  allow  us  to  distinguish  between  the  two.  We  have  approached  this  problem  by  assuming 
the  signal  is  a  sum  of  complex  exponentials  and  the  structured  noise  is  impulsive.  Under  these 
conditions  the  KiSS  algorithm  with  structured  noise  can  be  used  within  a  combinatorial  search 
to  locate  the  noise  impulses.  The  results  have  been  rather  unsatisfactory. 

A  problem  similar  to  subspace  identification  is  that  of  updating  existing  signal  and/or 
noise  subspaces  based  on  new  data.  For  signal  subspace  updates  without  structured  noise  we 
have  presented  in  Chapter  III  a  technique  using  Total  Least  Squares.  We  have  then  derived  an 
extension  of  that  technique  that  allows  simultaneous  updates  of  both  signal  and  noise  subspaces. 
This  technique  is  suitable  for  adaptively  identifying  slowly  varying  subspaces. 

In  Chapter  V  we  have  address  the  issue  of  order  selection  when  identifying  subspaces. 
We  have  proposed  a  new  order  selection  rule  for  rank  reduction  in  the  Linear  Statistical  Model 
without  structured  noise  based  on  two  sequential  applications  of  the  principle  of  Maximum 
Likelihood.  The  new  rule  performs  only  slightly  better  than  existing  rules  and  is  rather  ex¬ 
pensive  computationally.  The  most  intriguing  aspect  of  this  result  is  the  concept  of  multiple 
applications  of  ML. 

We  have  also  derived  in  Chapter  V  a  Bayes  hypothesis  test  for  order  selection  in  the 
identification  of  structured  noise  subspaces.  Even  with  its  solid  theoretical  foundation,  the 
Bayes  rule  performs  only  a  little  better  than  existing  rules  that  are  rather  more  ad  hoc. 

7.3  Estimation  Problems  in  Structured  Noise 

We  have  derived  signal  estimators  and  parameter  estimators  for  several  structured 
noise  problems,  distinguished  by  the  amount  of  statistical  information  available  about  the 
underlying  processes.  In  the  first  case,  without  probability  densities  on  the  signal  or  noise,  it 
was  found  that  the  least  squares  estimator  of  a  signal  in  structured  noise  is  the  oblique  projection 
of  the  received  data  vector  onto  the  signal  subspace  with  the  structured  noise  subspace  in  the 
null  space  of  the  projection. 

When  a  Gaussian  density  function  is  placed  on  the  parameter  vector  of  the  structured 
noise  term  in  the  model  the  solution  still  involves  an  oblique  projection  whose  range  is  the 


129 


signal  subspace.  The  null  space  in  this  case  is  not  perfectly  aligned  with  the  structured  noise, 
but  tends  toward  the  structured  noise  subspace  when  the  structured  noise  dominates  the  back¬ 
ground  noise.  On  the  other  hand,  as  the  relative  structured  noise  power  decreases,  the  oblique 
projection  converges  to  the  orthogonal  projection  onto  the  signal  subspace. 

Finally  we  have  given  an  example  application  in  the  decoding  of  linear  block  codes 
over  the  Real/Complex  number  field.  This  application  illustrates  both  subspace  identification 
and  parameter  estimation  aspects  of  structured  noise  processing. 

7.4  Extensions 

When  identifying  impulse  noise  subspaces,  the  combinatorial  search  for  the  best  se¬ 
lection  matrix  can  become  impractical  for  large  data  lengths.  Further  study  is  warranted  on 
ways  to  reduce  the  search  space.  For  example,  with  BCH  codes  the  Prony  type  method  of 
Kumaresan,  Tufts,  and  Scharf  [KTS84]  may  be  used  to  obtain  a  starting  point  for  the  search. 

The  simultaneous  identification  of  signal  and  structured  noise  subspaces  has  proven 
to  be  a  difficult  problem,  and  our  current  approaches  are  not  adequate.  We  explain  in  Chapter 
IV  that  there  appears  to  be  a  connection  between  the  subjective  smoothness  of  the  IviSS 
objective  function  and  the  correctness  of  a  candidate  structured  noise  subspace.  We  suggest 
future  research  to  quantify  and  exploit  this  connection  and  try  to  develop  a  better  method  of 
simultaneous  subspace  identification. 

Signal  detection  problems  in  structured  noise  remain  open  as  a  possible  extension 
of  this  research.  The  coordinate  transformation  relating  oblique  and  orthogonal  projections 
should  be  of  use  here  in  specifying  invariance  conditions  for  the  detector.  For  example  the 
group  of  transformations  which  characterize  invariance  to  structured  noise  and  to  rotation  of 
the  signal  within  the  signal  subspace  is  given  by 

3(SL)  =  F-1  [(UhRhUh"  +  Phx)Fjz+  UHx^]  ,  (7.1) 

where  F  is  the  coordinate  transformation  of  Chapter  II,  RH  is  any  rotation  in  the  signal 
subspace.  UH  is  an  orthogonal  span  for  the  signal  subspace  (H).  UHx  is  an  orthogonal  span  of 
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(H)x,  and  &  is  any  (n  —  m)-vector.  The  maximal  invariant  statistic  for  detecting  an  unknown 
signal  in  a  known  signal  subspace,  with  known  noise  variance,  under  these  invariance  conditions 
is 

U.HEh>sH Eh.sU 

a  = - 5 - ■  (<-2) 

no1 

Detection  statistics  could  be  derived  for  other  cases  of  known  versus  unknown  signal,  noise 
variance,  and  structured  noise. 

The  ML  esitmators  of  noise  variance  in  Sections  3.2  and  3.4  provide  an  interesting 
area  for  further  study.  As  mentioned  at  the  end  of  Section  3.2,  the  normalization  by  n  would 
seem  to  be  better  replaced  by  a  normalization  by  n  —  r,  since  we  have  noise  power  in  only 
n  —  r  dimensions  available  for  estimating  <j~  .  A  study  of  the  mean  and  variance  of  these  ML 
estimators  of  <j~  may  reveal  a  correctable  bias.  The  difficulty  is  in  finding  the  expected  value 
of  a  singular  value  of  a  matrix.  The  way  around  this  difficulty  may  be  in  finding  an  alternate 
expression  for  the  estimator  that  does  not  involve  singular  values. 

The  possible  role  of  oblique  projections  in  spectral  estimation  should  be  examined. 
This  would  include  investigation  of  quadratic  forms  in  oblique  projections,  and  possible  fre¬ 
quency  domain  analogs  of  oblique  projections. 
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APPENDIX  A 


MATLAB  Code  for  the  KiSS  Algorithm 


The  following  code  implements  the  KiSS  algorithm  with  constraints  and  optionally 

with  structured  noise.  The  main  function  KISS  uses  subfunctions  AAINV,  BUILDB,  KSS- 

DATA,  MOD,  and  ORTHPROJ  also  listed  here.  For  phase  2,  the  KISS  function  can  use 

UMSOLVE,  available  through  the  MATLAB  user  group. 

function  [a, b, armors, thpath.T.c , iterl ,iter2, termcode ,phla,ermorml]  »  .. 
kiss(y .m.constr ,ph2,toll , tol2,usecirc , aO ,S, useglob, f debug) 

X  [a, b.errnorm, thpath.T.c, iterl, iter2, termcode]  •  .. 

X  kiss (y , a , constr , ph2 , t ol 1 , t  ol2 , usecir c , aO , S , useglob , f debug) 

7. 

X  This  function  implements  the  KiSS  algorithm  for  estimation  of 

X  parameters  of  superimposed  sinusoids.  Inputs: 

%  y  -  observed  data  vector  (required) . 

X  m  -  model  order  (required) . 

X  constr  -  Constraint  option: 

X  [1]  »  Nontriviality  constraint  only. 

X  2  «  conjugate  symmetry  constraint . 

X  3  »  Real  coefficients. 

X  4  ■  Real  and  symmetric. 

X  S  ■  conjugate  symmetry  plus  "project  to  circle"  contraints. 

X  6  »  real  symmetry  plus  "project  to  circle"  constraints. 

X  ph2  -  Phase  2  option: 

X  [0]  *  No  phase  2. 

X  1  ■  Phase  2  by  constrained  Evans/Fishi  method,  (is  it  right?) 

X  2  »  Phase  2  by  Newton’s  method. 

X  toll  -  Phase  1  convergence  tolerance  (def ault»eps- (1/3)) . 

X  tol2  -  Phase  2  convergence  tolerance  (default=sqrt(eps)) . 

X  usecirc  -  permission  to  use  circulant  technique  for  inverse. 

X  aO  -  starting  point  for  AR  (denominator)  parameters. 

X  S  -  structured  noise  subspace. 

X  useglob  -  flag  indicating  that  the  following  global  variables 
X  have  been  declared  before  calling  kiss: 

X  kissglobY.kissgloby .kissglobT, kiss globe .kissglobYa, 

X  kissglobQ .kissglobA .kissglobu.kissglobnhat 

X  Doing  this  saves  much  computation  for  Newton  phase  2. 

X  f debug  -  Debug  mode  flag.  Show  progress  when  set. 

X 

X  Outputs: 

X  a  -  AR  (denominator)  parameters. 


7.  b  -  HA  (numerator)  parameters . 

7.  errnora  -  fitting  error  norm. 

thpath  -  sequence  of  iterates  of  theta. 

7.  T,c  -  constraint  model,  a  »  T«theta+c. 

%  iterl,iter2  -  number  of  iterations  in  each  phase 
'/.  termcode  -  from  the  Heston  phase  2  (see  umsolve) 
'/.  phla  -  AR  parameters  after  phase  1. 

7,  errnorml  -  fitting  error  norm  after  phase  1. 

7. 

7.  Richard  T.  Behrens,  May  1990,  August  1990. 

7. 

if  (margin  <  11) 
f debug  ■  0; 
end 

if  (nargin  <  2) 
clc 

snr  “  input( ’Enter  snr  for  kssdata:  ’); 

[y, sigma]  »  kssdata(snr) ; 
n  -  2S; 

m  »  input( ’Enter  model  order:  ’); 

f debug  *  2; 

else 

[n,k]  ■  size(y) ; 
if  (k  >  1) 
if  (n  >  1) 

error( 'First  input  argument  must  be  a  vector.’: 
end 

y  »  y. ’ ; 

n  »  k; 

end 

end 

if  (mod(m,2)  *«  1) 
isodd  »  1 ; 
q  ”  (m-l)/2; 
else 

isodd  »  0; 
q  -  (m-2)/2; 
end 

if  (fdebug»«2) 

constr  »  inputOEnter  constraint  option:  ’); 

ph2  *  input( ’Enter  phase  2  option:  ’); 

toll  *  input(’Enter  phase  1  tolerance:  ’); 

tol2  »  input( ’Enter  phase  2  tolerance:  ’); 

usecirc  »  input (’Permission  to  use  circulant  inverse 

S  »  input ( ’Structured  noise  subspace:  ’); 

useglob  *  0; 

f debug* 1 ; 

else 

if  (nargin<3) ,  constr»G  ;  ead 
if  (nargin<4) ,  ph2  *  □  ;  end 
if  (nargin<5) ,  toll  ■  [] ;  end 
if  (nargin<6) ,  tol2  »  G ;  end 
if  (nargin<7) ,  usecirc  «  [] ;  end 
if  (nargin<8) ,  aO  »  [] ;  end 


if  (nargin<9)  ,  S  «  []  ;  end 

if  (nargin<10),  useglob  »  □;  end 

end 

7. 

I,  Set  up  defaults. 

7. 

if  isempty(constr)  ,  constr-l;  end 
if  isempty(ph2) ,  ph2  »  0;  end 
if  iseapty(toll) ,  toll  »  epa'd/3);  end 
if  isempty(tol2) ,  tol2  »  sqrt(eps);  end 
if  isempty(usecirc) , 

uaecirc  *  ~(( (constr“«4)  ]  (constr*"6) )  t  (mod(n,2)”l)  &  i3odd)  ; 
usecirc  »  usecirc  k  (n  >  6*a);  '!.  An  approximation  of  when  it  pay3. 
uaecirc  ■  0;  %  Override  and  disable  circulant  inverses, 
end 

if  isempty (useglob),  useglob  *  0;  end 
if  ((~  isempty  (S)  )fc(ph2~“0) ) 

dispC’Cannot  use  phase  2  with  structured  noise.’) 

ph2  -  0; 

end 

Y  “  toeplitzCyC (m+1) :n) ,y ( (a+l) :-l : 1) ) ; 
if  (constr*»l) 

T  »  [zeros ( 1, 2*m) ;  eye(m)  j*eye(m)]; 

c  »  [1 ;  zeros(m,  D]  ; 

end 

if  ((constr«»2)|(constr—5)) 
if  isodd 

T  ■  [[zerosCl, q):  eye(q);  jay(q) ;  zeros(l,q)]  [j*eye(q+l) ;  -j*jay(q+l)]] 

c  »  [1 ;  zeros(2*q, 1) ;  1] ; 

else 

T  »  [[zerosCl ,q+l) ;  eye(q+l);  jay(q)  zerosCq.l);  zerosC 1 ,q+l )]  .. 

[j«eye(q+l);  zerosCl ,q+l) ;  -j«jay(q+l)]] : 

c  »  [1;  zeros (2*q+’ , 1) ;  1]; 

end 

end 

if  (constr*“3) 

T  »  [zero8(l,m);  eye(a)]; 
c  »  Cl;  zeros (a, 1)] ; 
end 

if  C(constr*«,4)|(constr»»6)) 
if  isodd 

T  ■  [zeroaCl.q, ;  eve(q)  ;  jay(q);  zerosCl, q)]; 
c  ■  [1 ;  zaros(2*q, 1) ;  1] ; 

else 

T  ■  [zerosCl, q+1) ;  eye(q+l);  jay(q)  zeros(q.l);  zerosCl , q+1)] ; 

c  »  [1;  zeros (2»q+l , l) ;  1] ; 

end 

eul 

Ytilde  »  Y«T; 

[k.q]  *  3ize(T)  ;  7.  Hereafter,  q  is  the  length  of  theta, 

if  lsempty(aO)  7.  start  with  Prony  Cor  constrained  Prony). 

Q  -  Y’«Y; 

theta  »  -real(T’«Q*T)real(T’»Q*c) ; 
else 


theta  »  pinv(T)*(aO-c) ; 
if  any(~  near(T*theta+c,aO) ) 

errasgC 1  initial  paraaeter  aO  did  not  satisfy  constraints’) 

end 

end 

if  (constr>»5) 

theta  ■  projroot(theta,T,c) ; 
end 

thpath  •  theta; 
iterl«0;  iter2»0; 
if  fdebug 

dispC’Ready  to  start  phase  1  iterations.’) 

keyboard 

e.*d 

if  (toll>0) 
old  ■  zeros (q, 1) ; 

while  ((~  all(near(theta,c!d.toll)))fe(iterl<100))  '/,  phase  1  iteration 

if  fdebug 

clc 

disp([’Phase  1,  iteration  ’  nun2str(iter 1)  ’.’]) 
dispO theta  ■  ’) 
theta 
pause (5) 
cl6 

plot (thpath1) 
pause (S) 
end 

old  ■  theta; 
if  isempty (S) 

q  »  Y’*aainv(theta,T,c,usecirc,n)*Y; 
else 

[aai.A]  »  aainv(theta,T,c .usecirc ,n) ; 
ap  ■  aai*A’;  aps  «  ap*S; 
q  ■  Y’*(aai  -  aps*inv(S ’*A*aps) *aps ’ )*Y; 
end 

theta  »  -real (T’»q*T) real (T ’ »q*c) ; 

-hpath  »  Cthpath  theta] ; 
iterl  ■  iterl  +  1; 
end 
end 

phla  »  T*theta+c; 

A  »  buildb(phla.n) ; 
e  »  orthproj(A)*y; 
ermornl  »  norm(e) ; 
if  (ph2»»l) 
old  »  zeros(q.l)  ; 

while  (~  all(near(theta,old,tol2)))  ’/,  phase  2  iteration 

if  fdebug 

clc 

disp([’Phase  2,  iteration  ’  nun2str(iter2)  ’.’]) 
disp( ’theta  *  ’ ) 
theta 
pause (5) 


plot(thpath’) 
pause (5) 
end 

old  *  theta;  7,  (not  sure  if  phase  2  is  correct 

[Q,A]  *  aainv(theta,T,c,usecirc,n);  7.  for  T  other  than  idendity) . 

W  -  A*Q; 
d  -  A’*y; 

L  ■  zeros(n, 1) ; 
for  k”l :» 

dak  ■  zeros(m+l , 1) ;  dak(k+l)  •  i; 
dAk  ■  buildb(dak.n) ; 

dUk  -  (dAk  -  W*(dAk’*A  +  A’*dAk))*q; 

L( : ,k+l)  -  dtf k*d ; 
end 

U  »(L’  +  Y’*W>)*W; 
q  -  U*Y; 

theta  ■  -real (T’*q*T) real (T ’ *q*c) ; 

thpath  «  [thpath  theta]  ; 

iter2  *  iter2  ♦  1; 

end 

end 

if  (ph2»«*2)  7.  Phase  2  by  Newton’ s  method, 

fparam  •  [Y ;  T.’;  c.’;  usecirc  n  useglob  zeros  ( 1  ,m-2)] ; 
details  *  zeros (17,1); 

if  fdebug,  details (1)  »  1;  end  X  (print  trace) 
details(2)“2;  7.  (use  the  hookstep) 
details(3)  *  1;  7.  (don’t  use  secant  update) 
details(4)  ■  1;  1  (use  analytic  gradient) 
details(16)  *  1;  7.  (scale  by  starting  point) 
details(l7)  *  1;  %  (use  analytic  hessian) 

[theta .termcode, path]  ■  .. 

umsolve( ’kissf ’ , theta .details , fparao, ’kissg ’ , ’kissh’ ) ; 
iter2  »  length(path)-2; 

thpath  ■  [thpath  path(2:  (length(path) -1) , : ) ’]  ; 
if  ((termcode»«2)|(teracode»«3)) 

Because  the  desired  minimum  is  rather  deep  and  narrow  we  sometimes 
%  need  to  ease  up  on  gradtol  (so  the  gradient  needn’t  get  so  small) , 

7.  and  on  steptol  (so  we  can  take  very  tiny  steps). 

7.  disp( ’Tryinging  harder  to  converge.’) 

details(4)  *  0;  7.  (don’t  use  analytic  gradient  -  for  kicks) 

details(8)  »  eps-(l/3);  7.  (ease  up  on  gradtol) 

detailsO)  »  2*eps;  7.  (steptol — allow  very  tiny  steps) 

details(l7)  »  0;  X  (don’t  use  analytic  hessian  -  for  kicks) 

[theta, termcode, path]  »  .. 

umsolve( ’kissf ’ , theta, details, fparam, ’kissg ’ , ’kissh’ ) ; 

iter2  ■  iter2  +  length(path)-2; 

thpath  «  [thpath  path(2:  (length(path)-l) , : )  ’]  ; 

end 

end 

if  fdebug 
plot (thpath ’ ) 

title( ’Final  KiSS  convergence  path.’) 
pause 
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dg 

unitcirc 
hold  on 

plot(roots(T*theta+c) , ’x' ) 

title( ’Final  KiSS  root  positions’) 

pause (S) 

hold  off 

axis< ’ normal ’ ) 

end 

l 

X  Compute  the  coefficients  and  the  final  error 

X 

a  ■  T*theta+c; 

A  »  buildb(a.n) ; 
e  ■  orthproj (A)*y; 
h  ■  y  -  e; 

HI  -  toeplitz(h(l:m) ,  Qi(l) ;  zeros (m,  D] ) ; 
b  -  Hl*a; 

ermorm  ■  norm(e); 
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function  [Q,A]  »  aainv(theta,T,c,usecirc,n) 

X  [Q.A]  *  aainv(theta,T,c,usecirc,n)  forms  the  toeplitz  coefficient  matrix 
X  A  from  the  prediction  polynomial  a«T*theta+c.  Then  Q.  the  inverse  of 
X  A’»A,  is  computed  either  directly  (if  usecirc»0)  or  by  a  fast  algorithm 
X  using  a  circulant  matrix  (if  usecirc«l). 

X 

X 

X  Richard  T.  Behrens,  May,  1990. 

X 

[m,q]  -  size(T) ; 

m  *  m  -  1; 

A  »  buildb(T*theta+c,n) ; 

[n,k]-size(A) ; 
if  usecirc 
v  -  A(: .1) ’*A; 

c  ■  v  +  [0  conj (v(k:-l:2) )] ; 

Cinv  “  circinv(c); 
qq  ■  v((m+l) : — 1 : 2) ; 

U  ■  [[eye(m);  zeros(k-m.m)]  .. 

[zeros (k-m,m) ;  hankel ( [zeros (m-1, 1) ;  qq(l)],  qq)]] ; 

VC  «  [[zeros(a,n-2*a)  toeplitz([conj(qq(l)) ;  zeros (m-1, 1)] ,  .. 
conj(qq))];  [rot90(eye(m) )  zeros(m,k-m)]]*Cinv; 
q  >•  Cinv  +  Cinv*U/(eye(2*m)  -  VC* U)*VC; 
else 

Q  »  inv(A’*A) ; 
end 


X  X 


142 


function  B  *  buildb(b.n) 

•/.  B  -  buildb(b,n) 

This  function  builds  the  toeplitz  matrix  B  from  the  coefficients  b. 

The  result  B  spans  the  perp-space  to  a  signal  built  from  posers  of 
X  the  roots  of  the  polynomial  b.  The  size  of  3  is  n  by  u-m  shere  m 

X  is  the  degree  of  polynomial  b. 

x 

%  This  B  is  as  defined  by  Bresler  ft  Macovski,  and  is  the  hermetian 
X  transpose  of  the  B  defined  by  Kumaresan,  Scharf  and  Shas. 

% 

m  »length(b)-l; 

B  "toeplitz (Cflipud(conj(b)) ;  zeros(n-m-l,  D]  ,  [conj(b(m+l) ) ;  zeros(n-m-l ,  D] ) ; 
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function  [y, nnorm]  *  kssdata(snr, fixed, n> 

X 

‘/.  [y, sigma]  »  KSSDATA (SNR, fixed, n)  for  fixed  »  0,  or 
X  [y, nnorm]  »  KSSDATA(SNR, fixed ,n)  for  fixed  »  1 

%  generates  a  signal  x(i)  according  to  the  model  given  in  the  Kumaresan, 

X  Scharf  and  Sham  paper  (same  data  as  Tufts  k  Kumaresam) .  It  generates 
X  samples  for  i  ■  0:(n-l)  of 
X 

’l  x(i)  ■  exp(j*omegal*i)  +  exp(j*pi/4)*exp(j*omega2*i) 

X 

X  oith  omegal  »  2*pi*0.52 
X  omega2  •  2*pi*0.S0 
X 

X  Noise  is  added  to  obtain  y(i),  oith  any  desired  input  SNR.  If  fixed«0 
X  the  noise  is  gaussian  with  the  right  EXPECTED  snr,  if  fixed*l  the  noise 
’/,  is  sperically  symmetrical  with  the  right  EXACT  snr. 

X 

if  (nargin  <  2) 
fixed  «  0; 
end 

if  (nargin  <  3) 

n  *  25; 

end 

omegal  -  2*pi*0.S2; 

omega2  »  2*pi*0.50; 

xl  -  exp(j*omegal*(0: (n-1) ) ’) : 

x2  *  exp(j*pi/4)*exp(j»omega2*(0: (n-l>) ’) ; 

x  »  xl  +  x2; 

if  (nargin  <  1) 

snr  *  input( 'Enter  desired  input  signal  to  noise  ratio:  ’); 
end 

if  fixed 

nnorm  »  sqrt ( (xl ’*xl) /(10' (snr/10) ) ) ; 
else 

sigma  *  1/sqrt (2*10- (snr/10) ) ; 
end 

rand( ’normal’) 

noise  ■  rand(n,l)  +  j*rand(n,l); 
if  fixed 

noise  *  (nnorm/norm(noise) )*noise; 
else 

noise  *  sigma*noise; 
end 

y  *  x  +  noise; 
if  (~  fixed) 
nnorm  ■  sigma; 
end 


'/,  return  sigma  instead. 
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theta*theta; function  [m]  »  mod(x.n) 

^Computes  x  modulo  n.  If  x  or  n  is  not  an  integer ,  it  is  first  rounded. 

r. 

‘/.Mail  H.  Endsley  4/87. 

X 

y  -  round(x) ; 
k  ■  round(n) ; 

*  -  y  -  k*fix(y/k) ; 
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function  [p,k]  *  orthproj(h) 

x 

X  [P,K]  -  ORTHPROJ(H) 

X  P  -  ORTHPRQJ  (H) 

X 

X  Returns  the  orthogonal  projection  matrix  P  whose  range  is  identical  to 
X  the  range  of  H.  Optionally  returns  K,  the  rank  of  P.  A  projection  is 
X  eidempotent  and  hermetian  symmetric  (i.e.  P*P»P  and  P’«P).  If  H  is 
X  full  rank  the  projection  is  computed  directly  (without  factorization). 
X  The  SVD  is  used  if  H  is  rank  deficient. 

X 

X  Richard  T.  Behrens  Nay  22,  1987. 

X 

k»rank(h) ; 

[m,n]*size(h) ; 
if  k»“n 

p»(h/(h’«h))*h’ ; 
else 

[u,s,v]  »  svd(h,0) ; 

p  »  u(: , l:k)  *  u(: ,l:k) ’ ; 

end 


